Free Tip: How to resize scanned PDFs with ghostscript for Adobe Acrobat OCR
I’m unaware of any free tool to perform OCR on a PDF and embed the resulting data in the PDF itself so it is text-searchable. If anyone knows of one, let me know. In the meantime, I use Acrobat Professional for this essential functionality.
High resolution PDFs produced by my scanner (HP Officejet Pro L7700) usually give the following error when I try to perform Acrobat OCR:
This page is larger than the maximum page size of 45 inches by 45 inches.
Surprisingly, there doesn’t seem to be any way to resize the page size of a PDF within Acrobat. It’s possible to print to a new PDF of the correct size, but this operation cannot easily be batched. If I apply the “crop” tool to resize the page in Acrobat, I get this error:
Page size may not be reduced.
Many report these issues in Adobe’s forums. The most common responses suggest reconfiguring the scanner or buying a new one.
I found nothing quick and easy after some googling for a simple ghostscript recipe to perform the batch pre-processing necessary to allow Acrobat to do the OCR. It’s not hard to do, just a bit of a trial-and-error pain to get the right switches.
For posterity, then, here is a simple command-line to make this happen (here under Windows, but could obviously easily be adapted for any other platform). First, download the latest ghostscript for your platform (at this time, 8.64 for Windows). Then:
gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -dPDFFitPage INPUT.pdf
And a simple inelegant script to batch process (again, under Windows/cygwin, but easily adaptable). Feel free to make more elegant:
#!/bin/bash for x in "$@" do echo -n Processing $x ... if [ ! -e "$x" ] then echo File $x missing. Exiting. exit 1 fi if [ -e gs_shrink_to_letter.pdf ] then echo Tempfile gs_shrink_to_letter.pdf exists. Exiting. exit 1 fi if ( gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=gs_shrink_to_letter.pdf -dPDFFitPage "$x" ) then echo Success. mv gs_shrink_to_letter.pdf "$x" else echo Error occurred, exiting. exit $? fi done
After converting your PDFs as above, you can then apply Acrobat batch OCR without a hitch.
Anonymous Jan 18
Why not just use a better OCR solution?
adam Jan 18
Any suggestions? Acrobat Professional is generally pretty good for everything else; it’s surprising that it has this particular failing.
Gernot Hassenpflug Jan 19
Just to let you know GhostScript version 8.70 is the latest available for Windows. See here: http://sourceforge.net/projects/ghostscript/files/
Chung-chieh Shan Jan 19
ocrodjvu?
Albert Jan 20
Tesseract is awesome, but I find myself converting from PDF to ppm to tiff. Still I like Tesseract.
Adam Rosi-Kessel Jan 24
Ken: that would require PDF->DJVU->PDF, right? Do you think everything would survive verbatim?
Morten Oct 2
This tip was wonderfully helpful to me in a situation where I really, really needed it. Many thanks!
Ella Jun 11
OR, you can export the pdf file to a jpg file, resize using photoshop, and import again to create a pdf file. No trails and errors.
Brandon Jul 13
I could really use something like this – have nearly 400k pdf’s that I would like to batch OCR but some are greater than the 45 inch maximum. I’m not familiar with ghostscript – how would I actually run this code on my folders? Thanks in advance!
RichS Aug 18
YAY!!! Thank you so much!!! This was so painful until I found your post. Here are simplified instructions for the less tech inclined:
(1) Select one of the download links for Ghostscript on this page: http://www.ghostscript.com/download/gsdnld.html
–>Download and install Ghostcript
(2) Copy your file to the directory Ghostscript was installed into.
(3) Rename your file to INPUT.PDF
(4) Press the START button and type ” CMD ”
(5) Right click the link and clike “run as administrator”
(6) Copy this text:
gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -dPDFFitPage INPUT.pdf
(7) Navigate to the folder where you installed Ghostscript, right click in the black, and select paste
(8) Press ENTER
Ta da!