Hyphenation in imported ebook, how to fix

WileEQuixote · August 2, 2021, 10:20pm

This long question is outside the scope of the LingQ product itself. It’s a kind of for-extra-credit question.

I used the marvelous, wonderful LingQ feature to import a PDF (the four Gospels, in Russian). LingQ happily imported the text, divided it into lessons, etc. This is so cool!

It seems the original document text is hyphenated, so the imported text has many instances of "- " (hyphen followed by a whitespace character) – 100 or so in the first lesson.

I just manually fixed lesson 1 in the browser within LingQ. It took maybe 10 minutes. I used ctrl-F to find the instances and manually deleted each instance of the two characters.

Anyhow, does anyone know any way I could automate this process, either inside LingQ or outside? Years ago I would use text editors that supported regular expressions to do stuff like this. But I don’t know how to obtain the raw PDF text that LingQ imports. Yes, I could import the text into LingQ, copy and paste each lesson’s text into a text file, edit each text file with a text editor, and import each edited text file into LingQ to create a new, hyphen-free lesson. It would be easier to manually edit them in place inside LingQ.

zoran · August 3, 2021, 1:50am

At the moment it’s not, unfortunately, possible to fix that and remove all hyphens on LingQ directly. But hopefully someone else will be able to give you good tips on how to do it before importing.

milanezi · August 3, 2021, 4:25pm

One automatic way to get the text out of PDF is to get a OCR scan of it (optical character recognition).

I used this site for books in Greek, but I see many more languages are supported, including English. All text ends up in one file which LingQ can parse and divide into lessons.

Be warned though, OCR is slow. I don’t remember how slow exactly, but I wouldn’t be surprised if it took up to 1 hour for 150 pages.

Another thing, while the site I recommended has no page limit, I once tried to scan 700 pages and it failed after waiting for a long time.
I split it into pieces of 150 pages each, had them scanned separately and it worked. I use the site below for splitting.

khardy · August 4, 2021, 6:24pm

Warning: I live and work in Linux, which is an incomprehensible foreign land to many. Even if you grasp it and have the tools, the solution I offer here might be more manual than you want.

First, there is a utility available named “pdftotext” that will extract the text from a PDF file. “Proper” PDFs made from text will almost always have the text available to this utility. But I’ve also found that even scanned PDFs from Google Books have the OCR’ed text embedded in the file, and that can be extracted by this utilitiy.

As for removing the hyphens from the end of the line and joining the world halves, this ‘sed’ script seems to do the job:

s/-$/###/
/###$/N;s/\n//
s/###//g

That first changes the hyphens at the ends of the lines to “###”, to differentiate them from words that are hyphenated in the middle of a line that need to stay hyphenated. It then joins the line with the next one wherever “###” is found at the end of a line, and finally removes all instances of “###”. (Choose some other character combo that doesn’t exist in the text if “###” is already present and needs to remain.)

pdftotext and the sed script could be wrapped in a single shellscript that would read a PDF and emit de-hyphenated text.

An issue that might need to be addressed is that Lingq, I believe, like Word and others expects a paragraph to be totally unbroken, and takes a linefeed to indicate a new paragraph. If not, it might still retain the linefeeds, breaking up a paragraph unnaturally when formatting it for presentation. More sed magic, or perhaps a Python or Perl script might be able to take care of this.

Edit: This example assumes the ASCII (UTF-8) dash character “-” (0x2D) is used as a hyphen. There’s a different Unicode hyphen character that may be used that would have to be substituted in the sed script, assuming that sed is Unicode-compatible.

Keriamon · August 4, 2021, 9:05pm

I had the same problem. I converted my .pdf to Word and did a find and replace on the hyphens, but instead of just taking it out, it seems to have replaced it with a space. (Or maybe there was some code somewhere that did it; I’m not sure. The space shouldn’t have been there, though.)

I’m getting good at reading Polish words split in half. LOL.

Jokojoko83 · August 4, 2021, 9:40pm

Transform the PDF into a .doc with Calibre software (free)

and then use Word or Openoffice to replace the hyphens with nothing.

For the specific case of reading the Bible in Russian without hyphens, you can find a more suitable version here:

Jokojoko83 · August 5, 2021, 8:53pm

WileEQuixote · August 9, 2021, 11:35am

Thank you all for the posts and solutions and links!