Opinion on LingQ's word counting system

ian_kury · February 27, 2015, 10:23am

I think LingQ shouldn’t count different forms of the same word.
I believe that seeing our “known words” count grow fast can be good for motivation, but I still think it’s more important that this number accurately reflects the actual number of known words!
It might be argued that understanding that a certain word is just a different form of another word is already proof that you have some knowledge beyond that of the original word, and thus would deserve to have your count increase.
However, differences between languages can make this system disproportionate.
For example, in a language that has cases, there are many more different forms of words.
Even without cases, there are still conjugated verbs, different forms of pronouns, plural, etc.
I’m using LingQ to learn Russian. My word count now is almost 4000, but I’m aware that a big part of these words consists of just different forms of the same words.
I think it’s not fair to have the following situation:
You learn 100 nouns in Russian, for example, and learn some declension rules. Then these 100 nouns could count as 1000 words, or maybe 5000…
Thus, it’s really hard to get an accurate idea of one’s true knowledge.

Yutaka · February 27, 2015, 12:30pm

I agree with your opinion, but the problem lies in the fact that the system is not smart enough to analyze words grammatically.
For example, it cannot differentiate between “lies” as a verb and “lies” as a plural noun.

Steve555 · February 27, 2015, 12:59pm

I agree, and this is a tough task computationally. As you said there are big difference in this respect amongst different languages, and it would require a different algorithm for each language. But I share your goal in getting a better sense of word count. Once I know a verb, I usually mark the base form as level 4 (or Known) and then as I come across variants I mark them as ‘Ignore’ (X). This goes a long way to reducing redundancy.

From what you say, Russian seems particularly affected by this. In Swedish, where verbs are simpler, I find calculating 40% of the Known Words total gives a pretty good estimate of the number of unique words. (FWIW, if a word has a noun, verb and adjective form, I prefer to consider them three unique words, but just personal preference)
If you could find out approximately what percentage of words are nouns or verbs in Russian, and the average number of forms for words in these categories, you could estimate a percentage.

ian_kury · February 27, 2015, 1:17pm

Yeah, as a programmer, I’m aware of the huge effort that would be needed in order to develop good algorithms for this. Though some problems are hard to solve, like Yutaka’s “lies” and “lies”, the algorithm can be improved gradually, and even though it could never achieve perfection, it would be able to get nearer and nearer an accurate number, up to a reasonable approximation, I would say.
I appreciate your suggestion, and, yes, of course the estimate would be different from language to language; and so far I don’t even have an estimate for Russian~

ColinJohnstonov · February 27, 2015, 3:21pm

I think making the system analyse grammatically the text to determine which words are original and which are not would be too difficult, and not just from a programming point of view. Yutaka already gave a great example with the word ‘lies’. Programming the system to recognise this as two words would be very difficult, but this is in fact something of an easy example. We know already that this is two words. What about something like the word ‘have’ in these two sentences

I have a banana
I have eaten a banana

Should have be counted as two different words here or one word? Probably we would agree that it is two. Even more difficult might be ‘table’ in these sentences

I built a table out of wood
I summarised all my data in a table

This is still a relatively easy example. How about the word ‘propose’

I propose we eat sushi tonight, without wasabi of course
I will propose to my girlfriend

Now are these two different words or one? Have a look at the list of definitions of this word in the dictionary.

There are nine definitions given there, but I can’t really decide how many distinct meanings there actually are.

ColinJohnstonov · February 27, 2015, 3:26pm

I realise that ian_kury’s original suggestion was not actually that the system distinguish different meanings of the same string of letters. The original suggestion was that the system doesn’t count different conjugated forms of words as separate words in the word count. This might be much easier, though there are going to be some difficulties. For example, in German, the word ‘weiß’ could be the colour white or it could be a form of the verb ‘wissen’ (to know). Maybe though, the system doesn’t need to be so accurate here. Anyway, I am not worried by the high known word count.

Ress · February 27, 2015, 7:13pm

Let’s live with the KISS principle: Keep It Simple Stupid

ColinJohnstonov · February 27, 2015, 8:53pm

I prefer the SMOOCH principle: So Many Overcomplications Oughtn’t Commonly Happen!

pauler · February 28, 2015, 9:06am

If the known word counter system on LingQ is the most buggy and stupid I’ve ever seen (which it definitely is!), then, yes, I do agree with this principle! The only problem being, it can’t be any MORE stupid than it already is!

pmilone · February 28, 2015, 11:28am

A word is a word! It doesn’t matter to me if it’s another form of a word. Either you know it or you need to learn it. This is especially true with irregular verb conjugations. I suppose if you are really concerned about this, you can just ignore any words you think shouldn’t be included in your word count.

pauler · February 28, 2015, 11:41am

Hi ! =)))

This is exactly the reason I have a count of ZERO words in ALL languages!

ColinJohnstonov · February 28, 2015, 2:04pm

Well, let’s not exaggerate.

Yutaka · February 28, 2015, 2:10pm

I just say “Don’t COUNT on it!”

juliohart · March 1, 2015, 12:07am

As the Chinese say, 1001 words is worth more than a picture. - jmc

Yutaka · March 1, 2015, 1:18am

Are you referring to this?
“A picture is worth a thousand words”

Are you supposing that 1,001 is larger than 1,000 by one?