Converting PDF/ Images into OCR Text to Import

FemmeApprendLeFrancais · March 17, 2020, 11:16pm

Hi all,

I have a bunch of newspapers, magazines, and articles I would like to import into LingQ. But they’re all in PDF format. Some of them also don’t have optical character recognition (OCR).

What are the best OCR software programs for pdf versions of the press?
What software exists that will convert these OCR pdfs into plain text files in order to import into LingQ?
I gladly welcome all advice/ suggestions!!!

musicserver77 · March 24, 2020, 3:25am

I often use google translate app, take a picture and load it into app, then copy not the translation but the original text.

benscheelings · March 24, 2020, 1:28pm

What is the extention of the pictures you load into google translate, Musicserver?

herbm · March 24, 2020, 4:55pm

The following is a very tentative recommendation since I have only just found this program:

PDF24 at https://en.pdf24.org/

It looked a bit dodgy (just a general impression but I’ve been doing this a long time and generally trust my cautionary instincts) but so far it has done NOTHING obnoxious (no pop-up adds etc.). They are trying to sell their other services discreetly, – but it’s not in your face.

I ran the download through VirusTotal and it received 71 clean scans (if you don’t know, VirusTotal you should as it will test files again almost all the well-known, and many obscrure, anti-malware programs.)

So far I have converted 3 files. It isn’t particularly fast (on a large PDF) even on my very fast laptop, but it works well enough based on these first attempts.

Also, they have an online converter which I did not try. My guess is that they will email a link to the results when the file completes due to the slow speed.

Preliminarily I think it is worth trying.

FemmeApprendLeFrancais · March 25, 2020, 3:09am

Thank you and I’ll have to try this website out.

herbm · March 25, 2020, 3:52am

To be clear: I am using the Windows APP, as opposed to the website. (I didn’t try the website.)

Over two nights I’ve converted about 800,000 words in 4 PDFs but I really haven’t been trying very hard to push books through the app. I just leave it running and every so often find one of my old (but good) “image PDFs” and put it in there.

It’s fast enough to do what I cared about but slow enough that I don’t want to sit and watch it – also it will do multiple books without requiring you to wait to start the subsequent ones – just keep adding and starting them while it works on others.

I haven’t looked really close at the output but a quick paging through after dumping it to text with PDFtoText indicates it is probably good enough.

One test will be to upload one of them as a lesson in LingQ and see how many ‘trash words’ turn blue. (With some sources, I’ve had to just temporarily turn off automatic “known words” to avoid grabbing mostly trash) but that was with LingQ doing the importing and conversion instead of my tools.

PDFtoText is part of MiKTeX, and I see a copy in Msys64 and another in the Git within the mingw64 directory (plus some more since I install a ton of open source and “unix-like” tools on my Windows systems.)

Looks like I have version 4.02 running. (I probably should clean that mess of version up.)

musicserver77 · March 25, 2020, 10:35pm

More often than not I use the iPhone app, just take a picture (i.e. scan) within Google Translate. If using a browser then I assume pdf or text is best, so you’d have to save photo as pdf. The phone or iPad app very convenient for this reason.