LazyWeb: Search for non-OCR’d PDFs?

Filed under Uncategorized by adam | September 25, 2010 | 57634 hits | 5 comments

Another LazyWeb request: any suggestions for how to search (on any platform) for PDFs that have not been OCR’d?

5 comments

Peter De Wachter Sep 25

If you find nothing else, something like this should work:
pdftotext $pdf – | grep -q ‘\w’
Ma Sep 26

Take those .djvu-s — they’re entirely non-text. Or is it not what you’ve asked for? :)
Rupert Sep 26

How on Earth could this work? I presume you mean that you want to find all pdfs containing the phrase “lettuce” in a directory or something. But if the pdf hasn’t been OCR’ed, that means that the computer just has a load of great big pictures and doesn’t know what text is in each pdf. Rendering the pdfs searchable is OCR’ing them.

Are you looking for a tool that would somehow automatically run an OCR process on lots of documents? Do you realise that this is very very very CPU-intensive?
adam Sep 26

Sorry, perhaps my question wasn’t clear. I would like to be able to generate a list of PDFs that don’t have OCR text in them. There must be some relatively simple way to distinguish between bitmap and embedded text PDFs. Someone in another forum recommended checking the PDF for frames per page — if each page has more than two frames, it is likely a searchable PDF.
adam Sep 26

Here’s my first attempt at a hack that seems to work: http://adam.rosi-kessel.org/code/is_ocr.html