Free Tip: How to resize scanned PDFs with ghostscript for Adobe Acrobat OCR

I’m unaware of any free tool to perform OCR on a PDF and embed the resulting data in the PDF itself so it is text-searchable. If anyone knows of one, let me know. In the meantime, I use Acrobat Professional for this essential functionality.

High resolution PDFs produced by my scanner (HP Officejet Pro L7700) usually give the following error when I try to perform Acrobat OCR:

This page is larger than the maximum page size of 45 inches by 45 inches.

Surprisingly, there doesn’t seem to be any way to resize the page size of a PDF within Acrobat. It’s possible to print to a new PDF of the correct size, but this operation cannot easily be batched. If I apply the “crop” tool to resize the page in Acrobat, I get this error:

Page size may not be reduced.

Many report these issues in Adobe’s forums. The most common responses suggest reconfiguring the scanner or buying a new one.

I found nothing quick and easy after some googling for a simple ghostscript recipe to perform the batch pre-processing necessary to allow Acrobat to do the OCR. It’s not hard to do, just a bit of a trial-and-error pain to get the right switches.

For posterity, then, here is a simple command-line to make this happen (here under Windows, but could obviously easily be adapted for any other platform). First, download the latest ghostscript for your platform (at this time, 8.64 for Windows). Then:

gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -dPDFFitPage INPUT.pdf

And a simple inelegant script to batch process (again, under Windows/cygwin, but easily adaptable). Feel free to make more elegant:

#!/bin/bash
for x in "$@"
do
echo -n Processing $x ...
if [ ! -e "$x" ]
then
echo File $x missing. Exiting.
exit 1
fi
if [ -e gs_shrink_to_letter.pdf ]
then
echo Tempfile gs_shrink_to_letter.pdf exists. Exiting.
exit 1
fi
if ( gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=gs_shrink_to_letter.pdf -dPDFFitPage "$x" )
then
echo Success.
mv gs_shrink_to_letter.pdf "$x"
else
echo Error occurred, exiting.
exit $?
fi
done

 
After converting your PDFs as above, you can then apply Acrobat batch OCR without a hitch.

19 comments

  1. Anonymous Jan 18

    Why not just use a better OCR solution?

  2. adam Jan 18

    Any suggestions? Acrobat Professional is generally pretty good for everything else; it’s surprising that it has this particular failing.

  3. Gernot Hassenpflug Jan 19

    Just to let you know GhostScript version 8.70 is the latest available for Windows. See here: http://sourceforge.net/projects/ghostscript/files/

  4. Chung-chieh Shan Jan 19

    ocrodjvu?

  5. Albert Jan 20

    Tesseract is awesome, but I find myself converting from PDF to ppm to tiff. Still I like Tesseract.

  6. Adam Rosi-Kessel Jan 24

    Ken: that would require PDF->DJVU->PDF, right? Do you think everything would survive verbatim?

  7. Morten Oct 2

    This tip was wonderfully helpful to me in a situation where I really, really needed it. Many thanks!

  8. Ella Jun 11

    OR, you can export the pdf file to a jpg file, resize using photoshop, and import again to create a pdf file. No trails and errors.

  9. Brandon Jul 13

    I could really use something like this – have nearly 400k pdf’s that I would like to batch OCR but some are greater than the 45 inch maximum. I’m not familiar with ghostscript – how would I actually run this code on my folders? Thanks in advance!

  10. RichS Aug 18

    YAY!!! Thank you so much!!! This was so painful until I found your post. Here are simplified instructions for the less tech inclined:

    (1) Select one of the download links for Ghostscript on this page: http://www.ghostscript.com/download/gsdnld.html
    –>Download and install Ghostcript
    (2) Copy your file to the directory Ghostscript was installed into.
    (3) Rename your file to INPUT.PDF
    (4) Press the START button and type ” CMD ”
    (5) Right click the link and clike “run as administrator”
    (6) Copy this text:

    gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -dPDFFitPage INPUT.pdf

    (7) Navigate to the folder where you installed Ghostscript, right click in the black, and select paste
    (8) Press ENTER

    Ta da!

  11. lexob9 Aug 31

    RichS: thanks man. initial instructions were very hard for me to apply, but your simple instructions were easy to follow. it worked in less then a minute for 400pg pdf document to resize. good job!

  12. Andy Sep 17

    This formula worked for me, but ONLY after I added the additional switch -dFitPage.

    I was using the Windows 64 bit version 9.10.

    So thanks for the help on getting my 45″ x 43″ page reduced for OCRing.

  13. Italo Oct 22

    RichS: Man, without you this post would be meaningless for me. It worked for me, now I can make my OCR to work. Thanks thanks a lot!!!

  14. idsnowdog Jan 2

    I have been using this script to resize pdfs, but it only works on one file at a time. I do not know scripting. Can this script be modified to convert *.pdf and have them each create an output of the same name with *-copy.pdf appended?

    #!/bin/sh

    gs -q -dNOPAUSE -dBATCH -dSAFER \
    -sDEVICE=pdfwrite \
    -dCompatibilityLevel=1.3 \
    -dPDFSETTINGS=/screen \
    -dEmbedAllFonts=true \
    -dSubsetFonts=true \
    -dColorImageDownsampleType=/Bicubic \
    -dColorImageResolution=144 \
    -dGrayImageDownsampleType=/Bicubic \
    -dGrayImageResolution=144 \
    -dMonoImageDownsampleType=/Bicubic \
    -dMonoImageResolution=144 \
    -sOutputFile=out.pdf \
    $1

  15. Adam Rosi-Kessel Jan 3

    Just add a for loop:

    for x in *.pdf
    do
    gs [options] -sOutputFile=”${x%%.pdf}-copy.pdf” “$x”
    done

  16. Lance Jan 28

    RichS and all,

    Thanks for the helpful thread. I’m still a bit of a newbie when it comes to GS and am having some trouble with syntax. Hope you can help me out!

    I have 64-bit gs9.10 installed on a Windows machine and have the file I want to resize in the same folder as my gs install.

    Here’s what I am running from the cmd line:
    >gswin64c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -dFitPage INPUT.pdf

    This generates the following message:
    GPL Ghostscript 9.10: **** Could not open the file OUTPUT.pdf . **** Unable to open the initial device, quitting.

    Thanks!

  17. adam Jan 28

    Perhaps you are running gs in a directory where you do not have write permissions? Try putting in an explicit path.

  18. Lance Jan 30

    adam, Thanks! That did the trick. I have admin rights, but I every once I a while I have to enter my credentials prior to the running of a function. Using an explicit directory that was not restricted did the trick. Thanks!
    Lance

  19. William Jun 19

    Thanks a lot ! Thanks a lot ! It helped me to run OCR on a 60 MB 170 page book in a breeze. The initial tutorial was a bit tough but the simplified version from RichS did the trick for me.

Leave a Reply

(Markdown Syntax Permitted)