Free Tip: How to resize scanned PDFs with ghostscript for Adobe Acrobat OCR

January 2010
M	T	W	T	F	S	S
	1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Filed under Code, Free Software, Linux by adam | January 18, 2010 | 50943 hits | 25 comments

I’m unaware of any free tool to perform OCR on a PDF and embed the resulting data in the PDF itself so it is text-searchable. If anyone knows of one, let me know. In the meantime, I use Acrobat Professional for this essential functionality.

High resolution PDFs produced by my scanner (HP Officejet Pro L7700) usually give the following error when I try to perform Acrobat OCR:

This page is larger than the maximum page size of 45 inches by 45 inches.

Surprisingly, there doesn’t seem to be any way to resize the page size of a PDF within Acrobat. It’s possible to print to a new PDF of the correct size, but this operation cannot easily be batched. If I apply the “crop” tool to resize the page in Acrobat, I get this error:

Page size may not be reduced.

Many report these issues in Adobe’s forums. The most common responses suggest reconfiguring the scanner or buying a new one.

I found nothing quick and easy after some googling for a simple ghostscript recipe to perform the batch pre-processing necessary to allow Acrobat to do the OCR. It’s not hard to do, just a bit of a trial-and-error pain to get the right switches.

For posterity, then, here is a simple command-line to make this happen (here under Windows, but could obviously easily be adapted for any other platform). First, download the latest ghostscript for your platform (at this time, 8.64 for Windows). Then:

gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -dPDFFitPage INPUT.pdf

And a simple inelegant script to batch process (again, under Windows/cygwin, but easily adaptable). Feel free to make more elegant:

#!/bin/bash
for x in "$@"
do
echo -n Processing $x ...
if [ ! -e "$x" ]
then
echo File $x missing. Exiting.
exit 1
fi
if [ -e gs_shrink_to_letter.pdf ]
then
echo Tempfile gs_shrink_to_letter.pdf exists. Exiting.
exit 1
fi
if ( gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=gs_shrink_to_letter.pdf -dPDFFitPage "$x" )
then
echo Success.
mv gs_shrink_to_letter.pdf "$x"
else
echo Error occurred, exiting.
exit $?
fi
done

After converting your PDFs as above, you can then apply Acrobat batch OCR without a hitch.

25 comments

Anonymous Jan 18

Why not just use a better OCR solution?
adam Jan 18

Any suggestions? Acrobat Professional is generally pretty good for everything else; it’s surprising that it has this particular failing.
Gernot Hassenpflug Jan 19

Just to let you know GhostScript version 8.70 is the latest available for Windows. See here: http://sourceforge.net/projects/ghostscript/files/
Chung-chieh Shan Jan 19

ocrodjvu?
Albert Jan 20

Tesseract is awesome, but I find myself converting from PDF to ppm to tiff. Still I like Tesseract.
Adam Rosi-Kessel Jan 24

Ken: that would require PDF->DJVU->PDF, right? Do you think everything would survive verbatim?
Morten Oct 2

This tip was wonderfully helpful to me in a situation where I really, really needed it. Many thanks!
Ella Jun 11

OR, you can export the pdf file to a jpg file, resize using photoshop, and import again to create a pdf file. No trails and errors.
Brandon Jul 13

I could really use something like this – have nearly 400k pdf’s that I would like to batch OCR but some are greater than the 45 inch maximum. I’m not familiar with ghostscript – how would I actually run this code on my folders? Thanks in advance!
RichS Aug 18

YAY!!! Thank you so much!!! This was so painful until I found your post. Here are simplified instructions for the less tech inclined:

(1) Select one of the download links for Ghostscript on this page: http://www.ghostscript.com/download/gsdnld.html
–>Download and install Ghostcript
(2) Copy your file to the directory Ghostscript was installed into.
(3) Rename your file to INPUT.PDF
(4) Press the START button and type ” CMD ”
(5) Right click the link and clike “run as administrator”
(6) Copy this text:

gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -dPDFFitPage INPUT.pdf

(7) Navigate to the folder where you installed Ghostscript, right click in the black, and select paste
(8) Press ENTER

Ta da!
lexob9 Aug 31

RichS: thanks man. initial instructions were very hard for me to apply, but your simple instructions were easy to follow. it worked in less then a minute for 400pg pdf document to resize. good job!
Andy Sep 17

This formula worked for me, but ONLY after I added the additional switch -dFitPage.

I was using the Windows 64 bit version 9.10.

So thanks for the help on getting my 45″ x 43″ page reduced for OCRing.
Italo Oct 22

RichS: Man, without you this post would be meaningless for me. It worked for me, now I can make my OCR to work. Thanks thanks a lot!!!
idsnowdog Jan 2

I have been using this script to resize pdfs, but it only works on one file at a time. I do not know scripting. Can this script be modified to convert *.pdf and have them each create an output of the same name with *-copy.pdf appended?

#!/bin/sh

gs -q -dNOPAUSE -dBATCH -dSAFER \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.3 \
-dPDFSETTINGS=/screen \
-dEmbedAllFonts=true \
-dSubsetFonts=true \
-dColorImageDownsampleType=/Bicubic \
-dColorImageResolution=144 \
-dGrayImageDownsampleType=/Bicubic \
-dGrayImageResolution=144 \
-dMonoImageDownsampleType=/Bicubic \
-dMonoImageResolution=144 \
-sOutputFile=out.pdf \
$1
Adam Rosi-Kessel Jan 3

Just add a for loop:

for x in *.pdf
do
gs [options] -sOutputFile=”${x%%.pdf}-copy.pdf” “$x”
done
Lance Jan 28

RichS and all,

Thanks for the helpful thread. I’m still a bit of a newbie when it comes to GS and am having some trouble with syntax. Hope you can help me out!

I have 64-bit gs9.10 installed on a Windows machine and have the file I want to resize in the same folder as my gs install.

Here’s what I am running from the cmd line:
>gswin64c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -dFitPage INPUT.pdf

This generates the following message:
GPL Ghostscript 9.10: **** Could not open the file OUTPUT.pdf . **** Unable to open the initial device, quitting.

Thanks!
adam Jan 28

Perhaps you are running gs in a directory where you do not have write permissions? Try putting in an explicit path.
Lance Jan 30

adam, Thanks! That did the trick. I have admin rights, but I every once I a while I have to enter my credentials prior to the running of a function. Using an explicit directory that was not restricted did the trick. Thanks!
Lance
William Jun 19

Thanks a lot ! Thanks a lot ! It helped me to run OCR on a 60 MB 170 page book in a breeze. The initial tutorial was a bit tough but the simplified version from RichS did the trick for me.
Alex Feb 12

Thanks, worked seamlessly for me with few remarks: minor adjustments are needed to the script line, taking into consideration the gs version you are using. Mine was gs9.15 for windows 8.1 x64.
Make sure your INPUT.pdf file is located in the same folder as the gs application.
Thanks again to all contributors!
Michael Aug 22

This advice was excellent. I tried it on Ubuntu and it worked. Just type a command line with ‘ghostscript’ instead of ‘gsview32c’
Melissa Dec 16

I read all of your comments and it seemed super complicated so I thought I’d try opening the PDF and select print, except instead of printing to a printer I chose the Adobe PDF printer and hit print and selected where I wanted to save the file and it worked like a charm! It then had resizedthe file to the correct letter size and allowed me to perform the OCR. So so much simpler! Hope this helps someone!
Atcold Aug 22

Wow, it did work!
Not sure yet about the ‘how’, though.
I imagine we `-dPDFFitPage` to `-sPAPERSIZE=letter`, while preserving all bookmarks and other stuff!
This was just awesome!
Cheers!
Atcold Aug 23

Question. The original file was 13.5MB and the output file is now 47.8MB.
Here’s the original file http://bit.ly/2wwWmGw
Is there a way not to increase the weight of the file?

Thank you!
Pat May 23

On 5/23/2021 I followed Melissa’s suggestion and printed the pdf (100 pages) to Adobe PDF printer option. Afterwards, I tried again to perform the OCR and it worked fine. However, it does create a very large file size, so I will probably try to reduce the file size.

The Substantially Similar Weblog

Search

Free Tip: How to resize scanned PDFs with ghostscript for Adobe Acrobat OCR

25 comments

Anonymous Jan 18

adam Jan 18

Gernot Hassenpflug Jan 19

Chung-chieh Shan Jan 19

Albert Jan 20

Adam Rosi-Kessel Jan 24

Morten Oct 2

Ella Jun 11

Brandon Jul 13

RichS Aug 18

lexob9 Aug 31

Andy Sep 17

Italo Oct 22

idsnowdog Jan 2

Adam Rosi-Kessel Jan 3

Lance Jan 28

adam Jan 28

Lance Jan 30

William Jun 19

Alex Feb 12

Michael Aug 22

Melissa Dec 16

Atcold Aug 22

Atcold Aug 23

Pat May 23

Leave a Reply

(Markdown Syntax Permitted)