Filed under Uncategorized by adam | September 25, 2010 | 36206 hits | 5 comments
Another LazyWeb request: any suggestions for how to search (on any platform) for PDFs that have not been OCR’d?
If you find nothing else, something like this should work:
pdftotext $pdf – | grep -q ‘\w’
Take those .djvu-s — they’re entirely non-text. Or is it not what you’ve asked for? :)
How on Earth could this work? I presume you mean that you want to find all pdfs containing the phrase “lettuce” in a directory or something. But if the pdf hasn’t been OCR’ed, that means that the computer just has a load of great big pictures and doesn’t know what text is in each pdf. Rendering the pdfs searchable is OCR’ing them.
Are you looking for a tool that would somehow automatically run an OCR process on lots of documents? Do you realise that this is very very very CPU-intensive?
Sorry, perhaps my question wasn’t clear. I would like to be able to generate a list of PDFs that don’t have OCR text in them. There must be some relatively simple way to distinguish between bitmap and embedded text PDFs. Someone in another forum recommended checking the PDF for frames per page — if each page has more than two frames, it is likely a searchable PDF.
Here’s my first attempt at a hack that seems to work: http://adam.rosi-kessel.org/code/is_ocr.html
Mail (will not be published) (required)
Design © 2006 by the undersigned | Theme sponsor: Web Hosting Sources