Solution - how to removing furigana (rubi characters) from a Japanese text in a very simple way!

For anyone learning Japanese or Chinese or Korean - these small characters that help us, normally, to read a text with hieroglyphs, are a huge nuisance when importing to LingQ!
When I was importing texts and audio from Librivox (recommended!), I noticed that the texts (taken from Aozora works collection) are full of furigana, so when I imported it was impossible to use them in LingQ… :frowning: Even though the audio was excellent and the text was all there, in front of my eyes - it was unusable… For a second I even thought I could erase them manually. :smiley: :smiley: :smiley: But no, don’t try, it’s too much. :slight_smile:
So I went googling, I found out that these characters are called “rubi characters” and they have special marking in HTML. When I searched how to remove them, I was quite discouraged by heavy programmers’ stuff, advice upon how go get rid of these charterers for Python programmers (well, it was complicated for me - but, but, maybe it’s easy for you - here - Digital Japanese Literature: Aozora Bunko – Arts and Humanities Research Computing)
However, I had to look for another solution, for simple mortals. :smiley: :smiley: :smiley: So I thought that if I know how these characters are marked, I can get rid of them in a simpler fashion… well, using just MsWord! :slight_smile:
And that’s how you do it:
a) go to the Aozora text (or any other online text you need that has furigana)
b) while in browser, press Ctrl+U (for Chrome, for other browsers - see here How to View the HTML Source Code of a Web Page) and…
c) you’ll the text with all the tags! (or whatever it is called - see the print screens attached - #1 and #2)
d) copy this text
e) paste it to MsWord
f) in Word go to “find and replace” (see print screen #3)
e) fill the “Find what” field with the stuff you want to get rid of* and leave the “replace with” empty; then hit the “replace all” button - and see the magic happen!
(*delete things one by one! - first get rid of “rb”, then “rt”, then “rp”, then “ruby”, then “<”, then “>” etc… IMPORTANT! Don’t touch the round brackets!)
f) now, the real magic! You didn’t touch the round brackets, right?

  • find and replace all left round brackets with “a”
  • find and replace all the right round brackets with “b”
  • in the “find and replace” window tick the “use wildcards” box
  • in the “find what” field write “a*b”, in the “replace with” - nothing, as before
  • press the “replace all”…
    CONGRATULATIONS!
    Now - copy and paste the clean text to LingQ :slight_smile: :slight_smile: :slight_smile:

#1

#2

#3

#4
Don’t delete the round brackets!
Substitute the left round bracket with “a” and the right one with “b”!

#5





1 Like

I found a mistake in my description, press Ctrl+U to view the HTML source (and not Ctrl+Q as I mistakenly wrote).
Now, mistake corrected! And I also added a link about how to do it for browsers other than Chrome. :slight_smile:

Alisa, this is great. Could you make these lessons public for everyone to view (the copyright with these are more than likely expired and it’s OK to share publicly).

I want to have a section in LingQ for Japanese literature for other readers to check out! It would be awesome if you could share your lessons, please let me know!

1 Like

Hello, Eric! Yes, seems like LibriVox is free domain and I did intend to share, of course! :wink: It’s just that the I’m still working on it - I’ve added 6 stories from the 4th collection and there are 6 more to add, and it somewhat takes time :slight_smile: Since I know now that someone is waiting I’ll try to add them faster, I’ll let you know as soon as it’s ready, for sure!

Thanks so much, we’re happy to hear.

All 12 ready! Finally, hooray!
Here - Login - LingQ
But it’s just one collection out of ten that are available there (LibriVox), and there are also solo stories…

Excellent work Alisa! ありがとうございます。I’m currently adding RSS feeds for various languages at the moment for users to import.

1 Like

This is great! Thanks!

1 Like