Estimating the number of distinct words in a given language

edwin · May 15, 2014, 3:40am

I have almost 23000 Russian LingQs but every time I create a lesson, I see a lot of blue words. It is kind of frustrating. So I wonder how many distinct words are there in the entire Russian language.

Here is my attempt to find out the answer.

I remember back in high school, I learned of a method to estimate the number of fish in a pond. You capture some fish (sample size = s1), mark them and release them. Then you capture some fish again (sample size = s2) and count the number of marked fish (m). The total number of fish in the pond would be:

(s1 x s2) / m

So I tried to use this method to estimate the number of distinct words in Russian. I selected a random Russian article of medium length in an unfamiliar topic. I created a lesson out of it, and counted the number of new words vs LingQed + known words. Known words are not traceable but fortunately I didn’t have a lot of them.

I repeated this 2 more times, and here are the results:

Current total LingQed words: 22966

Article 1:
New words: 191
LingQed words = 303

(191+303) x 22966 / 303 = 37443

Article 2:
New words: 238
LingQed words = 432

(238+432) x 22966 / 432 = 35619

Article 3:
New words: 227
LingQed words = 392

(227+392) x 22966 / 392 = 36265

So I estimated that the total number of distinct words in the Russian language is about 36442.

Any comment on my methodology?

I also wonder if anyone else wants to do a similar exercise to confirm this estimate, or to do estimates on other languages.

steve · May 15, 2014, 4:57am

I have 86,000 known words and 45,000 LingQs in Russian. I still encounter unknown words.

Back to the drawing board.

Note that we count every form of a word as a separate word.

ColinJohnstonov · May 15, 2014, 8:38am

I would guess that your method might work for the fish and not for the words because in the fish example, the probability of being picked is evenly distributed over all of the fish, whereas in the words example, the probability of a word appearing in a given lesson is very different for different words. This will mean you will always underestimate the number of words using this method.

Imagine in the fish example that certain fish are much more likely to be caught than others. This will mean that your value of m will be much larger than you would otherwise expect since you will be more likely to catch the fish the second time that you caught the first time.

ColinJohnstonov · May 15, 2014, 9:01am

Ok, so here is how the method works. You have a sample of N fish in the pond, and you pick s1 of them, and then you pick s2 of them. The probability of a given fish being in the first sample is s1/N. Similarly, the probability of a given fish being in the second sample is s2/N. Therefore, the probability of a given fish being in both samples is (s1/N)x(s2/N), which is

s1 x s2 / N^2

The number of fish, m, you would expect to be in the sample of fish that appear in both the first and second samples is the probability of a given fish being in both samples multiplied by the number of fish. This is not necessarily the number of fish that will be in both samples, since at random, you could pick the same few fish both times, or completely different fish both times, but it is some estimate of how many you would expect.

m = (s1 x s2 / N^2) x N

Which rearanged given you the formula proposed at the beginning.

N = s1 x s2 / m

But notice that I have made a certain assumption. When I first calculated the probability of a given fish being in the first and second samples, I got s1/N and s2/N. This is only true if the probabilities were equaly distributed over all of the fish, i.e. if all fish are equally as likely to be picked as all other fish.

Your method does not work because there are vast differences in the probabilities of a given word appearing in a lesson, and therefore, if you have N words in the language, and s words in the lesson, the probability of a given word appearing in a lesson will not be s/N, but something very different.

obordal · May 15, 2014, 10:21am

Hehe, blue text is what I see in every Finnish lesson on LingQ. I guess it must be the same for Russian texts, too.

Now, if you are really interested in frequencies and words, here is an excellent source: The frequency dictionary for Russian

The site gives you lists of lemmas (that is, words in their basic uninflected form). To count the number of actual words, you can safely multiply at least by 10 (we have 6 grammar cases, singular and plural, present and past tense, etc). Of course, many forms coincide or are not applicable, which is yet compensated by the language’s rich inventory of suffixes & prefixes.

As 5000 lemmas comprise about 82% of word forms in texts, we can safely assume that if the number of linked words is about 50 thousand, then you know about 80% of the language’s word stock.

Of course, there are various “dictionaries” out there that give you lists of words that have ever occurred in texts. I am currently looking into one of these:

name : russian.dic
language : russian
description : слова, встречающиеся в русском языке
version : 1.20 / 031205
words : 296790
size : 2865575
CR
compilation : Inu Yasha

I am not sure what “CR” means, but the number of words is rather big. Yet, if you look inside the fil, the words are pretty weird:

аа - ?
аав - name?
аави-хасан - name
ааво - (finnish) name?
ааза - name Ааз in genitive case?
аазе - same name in prepositional case?
аазмандиус - strangely reminds of Ozymandias
аазом - name Ааз in instrumental case?
аазу - same name in dative case?
ааленского - name
ааленском - name
аало - name
аалто - finnish name
аалтонен - finnish surname
аамир - name
аамых - turkish name?
ааре - name
аарне - name
аарон - jewish name
аарон-сауль - name
ааронас - latvian name?
аароновщина - a term specific to the history of old Russian church; surely not very widely used
аатм - I guess this is AATM - away at the moment
аатолий - mistyped name Анатолий
аах - either some fictional name, or maybe an emphatic way of writing “а-а-ах” (oooh, ouch)
ааюн - some name
аб-нафик - arab name
аб-хак - (arab?) name
аба - reminds me of some turkish name or honorific suffix
абабакар - this, and
абабакир - names of turkish or arab origin.

As you can see, these are mostly names and rarely used terms.

nobody · May 15, 2014, 1:02pm

The Oxford English Dictionary gives descriptions for 750,000 words for English (so says Wikipedia). I doubt Russian has less than that.

Of course, there are questions about what counts as a “Russian” (or “English” or whatever language) word, whether people still use it orally, whether it appears in literature in the past 100, 200 years, etc. It’s going to be arbitrary to some extent. Besides, words that are not very frequent to some people will be frequent for others, depending on what they read, their field of work, the people they talk to, and so on.

For the purposes of language learning, however, there will come a point where the number is irrelevant because you will understand most words you hear or read for the first time by context.

edwin · May 15, 2014, 3:04pm

Steve: “I have 86,000 known words and 45,000 LingQs in Russian. I still encounter unknown words.”

That’s discouraging…

steve · May 15, 2014, 3:57pm

Why is it discouraging? I know so many words in Russian that I follow all news programs, read books, watch movies, without major difficulty. I have been enjoying Russian for years, and slowly increasing known word count. As rafael says, it is what you know that matters, not what you don’t know.

edwin · May 15, 2014, 6:19pm

It is discouraging to keep seeing so many blue words whenever I open a new lesson. I feel like the previously marked LingQs have gone into a blackhole. Just a personally feeling anyway.

obordal · May 15, 2014, 7:05pm

It would be interesting to employ some sort of clustering scheme to lingqs… For example, if a word is not linked yet, but there are other links to words with the same root, then show that the word is “partially known”. Thus, the more words you know, the more “partial matches” will appear in texts.

But of course this logic is much more difficult to implement than it sounds.

steve · May 15, 2014, 7:22pm

We are looking at introducing some kind of connection between related words. It is on our long list, and we have some ideas on how to do this. How quickly we get there remains to be seen.

In my case, as a rapid LingQer, I find the frequent re-LingQing of related words to be beneficial. I usually go through a text LingQing as fast as I can. (This has become a lot faster now). Then I read the texts later with only yellow words, often on my iPad.

Re-LingQing forms of words that I have met before, gives me additional exposure to the these words and helps me learn them. I don’t mind having LingQs with 8 forms of the same word. I sometime batch move them to known in the Vocabulary section, where they often appear in a row, in alphabetical order.

But each person studies differently, I realize.

kcb · May 15, 2014, 8:06pm

In an ideal world we’d have a way to mark a “word family” as known if we felt like it and be done with all the forms of a word. Some people wouldn’t like the multiple meanings that would be ‘lost’, but, as someone with close to 80,000 known words in Spanish, I can say it would have been well worth it for me.

The majority of my blue words are word forms of words that I recognize instantly, some are proper names, and barely any are legitimately new words. It’s a matter of diminishing returns at this point.

I understand that such a word family obliterating function is likely difficult to implement, but we can dream. In my last text of 9000+ words, I had 83 blue words, of which 44 were words I marked as known, 35 were proper names I ignored, and a grand total of 4 I marked as lingqs. Often there are fewer than four in such a text. Maybe my Spanish situation just isn’t suited to reading on lingq any more. Of course, I still have my other languages (which would also benefit from such a function).

ColinJohnstonov · May 15, 2014, 8:27pm

I think for German, a continual annoyance is coming across compound words. They make up most of the new words I come across and consist almost always of two or more words that I already have set to known or have already LingQed. Saving them is a pain in the Bratwurst.

obordal · May 15, 2014, 8:40pm

“In an ideal world we’d have a way to mark a “word family” as known if we felt like it and be done with all the forms of a word.”

Here lies the problem. In real languages, there are word forms that coincide, and it may cause trouble.

I remember a story about early attempts at analyzing word families in Russian language. The guys wrote a computer program that would parse texts, extract individual words, and reduce inflected forms to their dictionary entry. The result was the list of words sorted by their frequency (the number of occurrences in the text).

So, the guys fed a huge electronic library to their programs and happily waited for the result. When they saw it, however, they were shocked: according to their program, the most frequent word in Russian language was the verb “какать” (to defecate), followed by noun “кака” (baby language for “bad thing”, which is most frequently the result of какать). Quick analysis showed that the program mistook words “какой”, “какая” (pronouns meaning “which”) for forms of much less frequently used words.

In the end, the guys had to implement full-fledged part-of-speech analyzer, with syntactic models, statistical schemes and the like.

steve · May 15, 2014, 9:01pm

Similar problems exist in English. act, active, activate, reaction, activity, activist, … one word family?

ColinJohnstonov · May 15, 2014, 9:07pm

‘act’

How many words is this string of letters?

Don’t act childish!
Act three was the best.
It’s just an act.
I act in a play.
It was a heroic act.
An act of parliament.

kcb · May 15, 2014, 10:04pm

Surely activate, activates, activated, activating COULD be counted as a word family. As for the others, who knows…I’m not saying it would be easy nor that everyone would agree on what counts as a word family.

And yet, some groups of words in some languages seem straightforward enough. How about every regular (in every tense and mood) verb in Spanish? Some of those words will have other meanings, but I for one don’t care so much. Is it so hard to distinguish between a past participle and an adjective in context and guess at the new meaning? Between a first person verb and a noun? Of course such ambiguity wouldn’t be for everyone.

If you didn’t count a word as known until you knew every meaning, I wonder how many words in English I’d ‘know’. Surely Colin’s list isn’t exhaustive…

Ress · May 15, 2014, 10:25pm

I was frustrated reading texts in Polish. The texts were almost completely blue. With 28k words it looks much better now - there are just some blue spots somewhere.
But a lot of English texts have only some dozens of blue words. I have just 7k known words in English.

ColinJohnstonov · May 15, 2014, 10:30pm

I have about 1300 words as known or LingQed in Chinese, and most texts are around 30% new or less. I have maybe 11,000 known or LingQed in Russian and regularly find texts with more than 50% new words.