Converting PDF/ Images into OCR Text to Import
I have a bunch of newspapers, magazines, and articles I would like to import into LingQ. But they're all in PDF format. Some of them also don't have optical character recognition (OCR).
- What are the best OCR software programs for pdf versions of the press?
- What software exists that will convert these OCR pdfs into plain text files in order to import into LingQ?
I gladly welcome all advice/ suggestions!!!
Tuesday at 03:25
The following is a very tentative recommendation since I have only just found this program:
It looked a bit dodgy (just a general impression but I've been doing this a long time and generally trust my cautionary instincts) but so far it has done NOTHING obnoxious (no pop-up adds etc.). They are trying to sell their other services discreetly, -- but it's not in your face.
I ran the download through VirusTotal and it received 71 clean scans (if you don't know, VirusTotal you should as it will test files again almost all the well-known, and many obscrure, anti-malware programs.)
So far I have converted 3 files. It isn't particularly fast (on a large PDF) even on my very fast laptop, but it works well enough based on these first attempts.
Also, they have an online converter which I did not try. My guess is that they will email a link to the results when the file completes due to the slow speed.
Preliminarily I think it is worth trying.Tuesday at 16:55
Thank you and I’ll have to try this website out.Wednesday at 03:09
To be clear: I am using the Windows APP, as opposed to the website. (I didn't try the website.)
Over two nights I've converted about 800,000 words in 4 PDFs but I really haven't been trying very hard to push books through the app. I just leave it running and every so often find one of my old (but good) "image PDFs" and put it in there.
It's fast enough to do what I cared about but slow enough that I don't want to sit and watch it -- also it will do multiple books without requiring you to wait to start the subsequent ones -- just keep adding and starting them while it works on others.
I haven't looked really close at the output but a quick paging through after dumping it to text with PDFtoText indicates it is probably good enough.
One test will be to upload one of them as a lesson in LingQ and see how many 'trash words' turn blue. (With some sources, I've had to just temporarily turn off automatic "known words" to avoid grabbing mostly trash) but that was with LingQ doing the importing and conversion instead of my tools.
PDFtoText is part of MiKTeX, and I see a copy in Msys64 and another in the Git within the mingw64 directory (plus some more since I install a ton of open source and "unix-like" tools on my Windows systems.)
Looks like I have version 4.02 running. (I probably should clean that mess of version up.)Wednesday at 03:52