Japanese Langauge Parsing/Spacing Problems

iheartnihon · November 15, 2016, 6:42pm

When importing a lesson in Japanese, the import tool attempts to parse the words. The issue is it doesn’t recognize longer phrases or grammar constructs, but instead chops them up into smaller words it thinks it recognizes within them.

For example, the phrase “その日” (pronounced sonohi) is generally in the dictionary including my popup dictionary Rikaichan but LingQ instead parses this as two words “その” and “日” The first problem is that it should recognize the word correctly and I’m stuck with seeing two words there when it should in fact be one. The second problem is that even if I try to make the entire phrase a lingq it includes the space between the words (which should NEVER be there) which also causes the text-to-speech to pronounce it wrong because if it WERE two different words then they would be pronounced together as “sono nichi” which is what it says. If I modify the lingq and remove the space in the “term” to just be その日 then the text-to-speech correctly says “sonohi” but then it won’t recognize the lingq in the article.

It gets even more ridiculous when you get grammar involved as conjugated verbs like 支配されていた get parsed as 4 separate words! In fact this is the passive progressive form of 支配する which means “to dominate” so it’s meaning is “was dominated” (or literally “was being dominated”) but LingQ instead falsely recognizes it as the 4 words “支配” “さ” “れて” “いた”. While those 4 tokens truly are words, this context does not contain those words.

I realize I can hit ‘Ignore’ on everything I don’t like, but it’s an annoying workaround. It makes a bad experience for the user and it’s why 200 + ppl (thread view count) on another popular Japanese learning site were not happy with using LingQ for learning Japanese.

Is this issue even on LingQ’s radar to fix?

usablefiber · November 16, 2016, 12:27pm

Yes, the spacing issue for japanese is brutal on lingq. Mark has said he wants to fix the issues but I wouldn’t hold your breath, simply because they don’t have the manpower at this point. My best strategy is to use rikaikun popup dictionary and create lingqs that the popup can give you words for. I agree that determining the grammar, endings, and particles in spacing is ridiculous.

iheartnihon · November 16, 2016, 6:16pm

I wonder if they’d ever consider hiring freelance… my boyfriend says it would be a relatively simple fix! Rikaichan does it in the first place so it’s not like nothing else hasn’t already figured it out.I bet they’d get a ton more paid users if they did.

usablefiber · November 16, 2016, 7:24pm

Mark, other Lingq people, have you guys considered this? Us Japanese learners would really, really appreciate it!

mark · November 16, 2016, 10:15pm

If it were only that simple! Freelancers still need to get paid and need to be managed and have the project explained and need to understand the issues and then try to figure out how Rikaichan identifies word splits and then try to implement that within our functionality. All of which needs to be maintained on an ongoing basis. It is actually a complex problem that we have already spent a significant amount of time on in the past.

We know we can eventually do it better but just have to find a window. We are going to be investing in improving our lesson editing and managing interface soon and maybe we will find time to look at Asian languages again at that point.

iheartnihon · November 16, 2016, 10:34pm

Can you elaborate on what things you guys have tried and why they have failed? Rikaichan is open source so you can see exactly what they did. I’m not sure how that is difficult to implement what they did into the import tool. Removing the spaces is one issue, but even just parsing better even with spaces in there would make a huge difference to a Japanese learner. You guys are missing out on a huge chunk of money from the Japanese language learning online community… most of which head to pay-for tools such as Wanikani etc. for SRS which they could be getting twice the benefit from your tool (as it is reading based and SRS combined).

mark · November 17, 2016, 12:17am

Honestly, we haven’t looked at this issue in some time but, as I recall, Rikaichan looks at the text in small sections when you identify a word whereas we needed to parse the whole text upon upload. I know we were aware of Rikaichan when we were looking into this. I just can’t remember all the issues we encountered. I’m sure there is more we can do there to improve it. It’s always just a function of time and resources. But, there always ending up being other things we need to work on first.

We will get to it but it’s a question of when. We do still have many users happily studying Japanese on LingQ the way it is. Yes, we’d like to make it better but it is still quite good as it is. I have used it myself quite a bit and don’t find the splits to be too much of an issue. Of course, it would be better if we could split better. In fact, at the time we were working on this splitting issue, Japanese was the language I was studying and I was quite insistent on the fact that we should be able to do what Rikaichan does but at a certain point we had to give up and move on.

iheartnihon · November 17, 2016, 2:01am

Ah anyhow, my boyfriend is the programmer out of the two of us (I am just the brainchild of his Japanese apps haha) and we were talking about it and thinking about few things. It seems that on import you have to look at every character anyhow, so it seems it wouldn’t make it too much slower to have to look at 10 characters at a time and look for the longest match you could find. That’s one possibly improvement that could be made if it isn’t too slow… what tweaking that needs, IDK. Perhaps another possibly is if you don’t need to parse on import (not some technical reason you have to) than parsing for words on each page of lesson … for this solution start with not adding spaces on import but just parsing to identify LingQs at pageload. That way you only have to look at a page. Then it’s a trade off of having to do it once vs. having to do it for each page. It seems simple from the outside but is it too slow is the question? Tons of things to consider. I hope this gets fixed one day… for all the people you have now studying Japanese on the site it can’t be many considering how many thousands you could reach out to on the messageboards of Wanikani/Kanji Koohi/JREF who have already voiced frustration over LingQ not being sufficient enough for their Japanese studies.

mark · November 17, 2016, 5:56am

Yes, those are all possible solutions but they still require us to spend the time. Thanks for the helpful suggestions. We will get to this as soon as we can.

mark · August 22, 2017, 3:17pm

Just so you know, all Premium members now have the ability to edit all lessons to adjust the word splits. We are hoping this will improve the splitting issue.