Lot of new blue words in japanese?

I notice that suddenly a lot of words I already marked as known come back as blue words, but as combined with other words. It’s strange. What is the cause of this? for instance, these 4 seperated words 1) 海賊 2) たち 3) の 4) 歓声
are now back in a distinct blue word: 海賊たちの歓声. My current lesson is full of those examples. I know this could happen from time to time, but I just got surprised of how many of those examples I encountered suddenly in one lesson, and wanted to check if this was due to a change in an algorithm, or maybe is it because someone (an administrator?) decided arbitrarily to mark this particular sequence of words as a word? I ask the question because I don’t know how blue words are created in the first place. And also I’m a little concerned that It might add a lot of junk to get rid of while reading.
That the lesson I’m refereing to was created some weeks ago.

other very basic examples:

オレ & が => オレが

これ & が => これが

Anyone who’s learning japanese would know that usualy the particle is not attached to the word; and in many cases, like ORE, CORE, those words appear so many times that it is certain that in the past they appeared unattached and I was then able to save them seperately.

it’s not that important, but I’d be curious to know why this happens now.

Hi ! here is the long and detailed answer :Improvements To Chinese And Japanese - Language Forum @ L...

The short version is, japanese and chinese have been updated with an improved algorithm, which means some words are not split the same way they used to be. Hence these new blue words :slight_smile:

I’m wondering if those examples with the particles being attached is good thing or not. Won’t that create redondant entries ?

オレが
オレに
オレは
オレで
etc.

OK Maybe some entries are impossible, but you understand what I mean.

Beside the problem of having to get rid of all those redondant entries, it also mess the stats because if we don’t pay attention we end up marking thousands of new words as known, when in fact those are not new words. I don’t want to sound too precious about the stats and the words count, but I rely on that to measure my progression. Suddenly, advanced 2 should not be at 22k words, but maybe 44k words.

Take a look at this lesson. Scan all the blue words. I started at like 330 I believe. All I do is deleting them. Like for instance: 海賊なんかに. This word is no good for anyone. All japanese words can be followed by なんか or なんかに. So if I would like instead to learn “海賊” by itself, I can’t do that, I have to mark it when it will appear alone, but that never happens. And if I try to select the word to create an entry with it, the javascript decides that I can’t do that; that I must highlight the whole sequence (why?). And even if I could, and eventually mark it as known when I end up knowning it, I will keep seeing it as blue all the time, because it will appear inside a sequence that I haven’t yet mark as known, etc. There’s a loophole in this. When I know 海賊, I’d wish as much as possible to not see it again in blue.

Could you offer to choose the old algorithm? It could be a choice. Because I find this change very frustrating the more I’m going along with my lessons.

So, am I the only one who find there is a problem with the new word separation in japanese which creates too many redundant words? I’d like to have others opinions. If I’m the only one thinking that, then I will just let it go but I’d really like to hear some feedbacks, especially regarding the examples I provided.

I’m having difficulty in breaking suggested blue new words/phrases into small words. I usually use google translate to translate some phrases I’m not familiar with. The way the LingQ currently suggests new words seems to me very illogical. I used to able to highlight together any words in any sequences I see important. For example here is a phrase. 私は今アメリカの大学で日本語を勉強しています。Currently Lingq suggests lingqing this way (私は今アメリカ、の大学で、日本語を勉強、しています。)I would rather lingq (私は、今、アメリカの、大学で、日本語を、勉強しています。)Do I make sense at all? Anyone having the same issues?

I’m really sorry for the delay in getting back to you on this! For some reason we didn’t notice this thread until now, so I want to apologize for that.

Yes, as Pizzalover noted there were some updates to the algorithm. The pro to this is that hopefully more words are split accurately. The con, however, is that some words will be split differently, and some things which were “better” with the old algorithm will be “worse” with the new one. Certainly it would be nice to pick and choose parts from the different algorithms, but this unfortunately isn’t a possibility at this time.

We did consider the idea of keeping the old algorithm around, though after a lot of deliberation we decided against it, as it would have created a lot more complexity trying to maintain two versions of each lesson to display to different users, and eventually, if another better algorithm came along, would either have to scrap both and add the new one, or continue on the same path and then support versions.

Hopefully this helps explain things a little more clearly!

Hi! Good news here - while the algorithm does have its preferred way of splitting text based on context, etc. it is possible to override this in lessons.

I’d be happy to give you editor access if you would like to manually improve the splitting in different lessons you come across. Let me know if you’d be interested in this!

For me, it breaks “words” as big chunks between commas and sentences.

1 Like

Thanks for the explanation Alex. I don’t mind that the words are split differently, but it’s just that I’d rather have each word left in a separated lingq, instead of combining them infinitely, creating too many redundant lingqs. Anyway, we’ll see how it goes in the future.

Thanks, alex, for your reply. I’m very much interested in your offer. Would you please show me how?

Agree, this needs more looking at. Alex?

Sure thing! I’ve now granted you editor access for Japanese, so you should see an “Edit Lesson” link appear in all Japanese lessons when clicking the gear icon at the top.

When you edit a lesson, you’ll see the words split a specific way. To override this, simply adjust the spacing on this page and save the lesson. The parsing algorithm only runs once - when the lesson is first uploaded - so any subsequent edits will be preserved on the lesson page.

If you have any questions in the meantime please let us know!

can It be used in personal imports also?

To get rid of those entires like this:

オレが => オレ が
オレに => オレ に
オレは => オレ は
オレで => オレ で

How does it work? do we have to make the split for each individual entries?

Would you give me an example of a lesson where it’s not being split correctly?

Yup, you can do this for your own personal imports too, though it applies only for each specific instance. We had considered doing this for all lessons, but updating the splitter to accommodate for this was quite tricky, so instead we opted for a simpler solution in the meantime.

Ok so It won’t help, because it mean that in each import, I’d have to adjust the split for about 200 words, which would take me like an hour, and I would have to do it all over again in the next import.

Yes, unfortunately it won’t be possible to revert to the previous way that the lessons were split.

What I might recommend as a temporary solution is to use the “ignore” feature liberally. You can simply ignore words that you have, for example, saved before, and don’t want to save again. I’m sorry I don’t have a better answer here!