Solution - how to removing furigana (rubi characters) from a Japanese text in a very simple way!

_Alisa_ il Israel

For anyone learning Japanese or Chinese or Korean - these small characters that help us, normally, to read a text with hieroglyphs, are a huge nuisance when importing to LingQ!

When I was importing texts and audio from Librivox (recommended!), I noticed that the texts (taken from Aozora works collection) are full of furigana, so when I imported it was impossible to use them in LingQ... :( Even though the audio was excellent and the text was all there, in front of my eyes - it was unusable... For a second I even thought I could erase them manually. :D :D :D But no, don't try, it's too much. :)

So I went googling, I found out that these characters are called "rubi characters" and they have special marking in HTML. When I searched how to remove them, I was quite discouraged by heavy programmers' stuff, advice upon how go get rid of these charterers for Python programmers (well, it was complicated for me - but, but, maybe it's easy for you - here - http://darthcrimson.org/digital-japanese-literature-aozora-bunko/)

However, I had to look for another solution, for simple mortals. :D :D :D So I thought that if I know how these characters are marked, I can get rid of them in a simpler fashion... well, using just MsWord! :)

And that's how you do it:

a) go to the Aozora text (or any other online text you need that has furigana)

b) while in browser, press Ctrl+U (for Chrome, for other browsers - see here https://www.computerhope.com/issues/ch000746.htm) and...

c) you'll the text with all the tags! (or whatever it is called - see the print screens attached - #1 and #2)

d) copy this text

e) paste it to MsWord

f) in Word go to "find and replace" (see print screen #3)

e) fill the "Find what" field with the stuff you want to get rid of* and leave the "replace with" empty; then hit the "replace all" button - and see the magic happen!

(*delete things one by one! - first get rid of "rb", then "rt", then "rp", then "ruby", then "<", then ">" etc... IMPORTANT! Don't touch the round brackets!)

f) now, the real magic! You didn't touch the round brackets, right?

- find and replace all left round brackets with "a"

- find and replace all the right round brackets with "b"

- in the "find and replace" window tick the "use wildcards" box

- in the "find what" field write "a*b", in the "replace with" - nothing, as before

- press the "replace all"...

CONGRATULATIONS!

Now - copy and paste the clean text to LingQ :) :) :)

#1

image

#2

image

#3

image

#4

image

Don't delete the round brackets!

Substitute the left round bracket with "a" and the right one with "b"!

#5

image

March 14 at 21:18
  • _Alisa_ il Israel

    I found a mistake in my description, press Ctrl+U to view the HTML source (and not Ctrl+Q as I mistakenly wrote).

    Now, mistake corrected! And I also added a link about how to do it for browsers other than Chrome. :)

    March 16 at 08:36
    • Administrator
      ericrobertz ca Canada

      Alisa, this is great. Could you make these lessons public for everyone to view (the copyright with these are more than likely expired and it's OK to share publicly).

      I want to have a section in LingQ for Japanese literature for other readers to check out! It would be awesome if you could share your lessons, please let me know!

      March 18 at 20:30
      • _Alisa_ il Israel

        Hello, Eric! Yes, seems like LibriVox is free domain and I did intend to share, of course! ;) It's just that the I'm still working on it - I've added 6 stories from the 4th collection and there are 6 more to add, and it somewhat takes time :) Since I know now that someone is waiting I'll try to add them faster, I'll let you know as soon as it's ready, for sure!

        March 19 at 06:41