I started this blog over a decade ago. Over time, my priorities have changed—family, work, home, etc. Many other avenues for online self-expression have also developed in the interim. I’m done for now. Old entries remain online, but do not necessarily reflect any of my current views, and certainly do not reflect any views of any of my employers.

LazyWeb: Search for non-OCR’d PDFs?

Another LazyWeb request: any suggestions for how to search (on any platform) for PDFs that have not been OCR’d?

Lazyweb Request: Profiling Timer Expired?

Dear Lazyweb:

I have a bash script with a while loop that takes a long time to process. It restores file modification times for complicated reasons not worth discussing here. Removing some nonessential stuff, I have the following code (I know it could be rewritten to be elegant, or at least collapsed into a single line):

cat ./preserve_file_mod_times | while read x
  filename=`echo $x|sed 's/|.*//g'`
  lastmod=`echo $x|sed 's/^.*|//g'`
  touch -t "$lastmod" "$filename"

As the input file has gotten longer, the loop now frequently fails:

376 Profiling timer expired touch -t "$lastmod" "$filename"

I’ve googled this error and understand what it is (i.e. SIGPROF) but not how to fix or workaround it. Any hints?


Lake Champlain Sunset

Lazyweb Search Request: Easy Content Management

Dear Lazyweb:

Can you suggest open source content management software that meets these criteria:

  • Cross-platform
  • Very easy to install and configure (from the admin side) and use (from the user side) — I’m thinking as easy from both sides as vqwiki
  • Drag-and-drop to upload content — ideally, the user could drag a DOC file from desktop into a widget in the browser to upload
  • Quick searching and indexing, at least of common file-types (including DOC and PDF)
  • Ability to set up arbitrary metadata elements and values that can assist as filters to searching

Generally, content will be located by search, rather than via any particular folder hierarchy.

Does it exist? There seem to be a lot of open source CMS options, but at least at first glance they may be overkill with a significant learning curve at least on the admin side.

Green Lemonade Success

My most successful Green Lemonade to date:

Green Lemonade

Green Lemonade

Approximate recipe:

  • Five or six leaves of kale
  • A bunch of fresh parsley
  • One-third of a medium-sized cucumber
  • A thick inch of fresh ginger (peeled)
  • A whole lemon (include seeds, peel, etc.)
  • A whole lime (likewise)
  • Two fuji apples (cored)

Juice and consume.

I have heard from green-juice skeptics before. It may be hopeless for some poor souls. But don’t knock it until you’ve tried it.

Windows curly quotes, accented characters on Linux Samba Shares and Cygwin XTerm: How to get Windows-1252 (AKA CP1252) from Linux

Before I forget: I have a bunch of files I mirror between Windows/NTFS and Linux/ext4 filesystems that include not only accented characters but curly quotes in the filenames. (I know: the easiest solution would be to just get rid of the extended characters). The curly quotes were created in Windows, so don’t render properly in standard Linux character sets (UTF-8, iso8859-1, iso8859-15, etc.).

This all came up because iTunes under Windows couldn’t find curly-quote files when it was reading from the exported Samba share filesystem rather than an attached NTFS drive. The files showed up as missing because they had different filenames.

The solution was not easily google-able, so for the record, in brief, add this to the [Global] section of /etc/samba/smb.conf:

unix charset = cp1252
display charset = cp1252

And reload Samba.

Also, to make the characters render properly from a terminal on the Linux box, first create the relevant character set:

sudo localedef -f CP1252 -i en_US en_US.CP1252

Now you can use this charset on your Linux box, and, like magic, the curly characters will be back:

export LC_ALL='en_US.cp1252'

Free Tip: How to resize scanned PDFs with ghostscript for Adobe Acrobat OCR

I’m unaware of any free tool to perform OCR on a PDF and embed the resulting data in the PDF itself so it is text-searchable. If anyone knows of one, let me know. In the meantime, I use Acrobat Professional for this essential functionality.

High resolution PDFs produced by my scanner (HP Officejet Pro L7700) usually give the following error when I try to perform Acrobat OCR:

This page is larger than the maximum page size of 45 inches by 45 inches.

Surprisingly, there doesn’t seem to be any way to resize the page size of a PDF within Acrobat. It’s possible to print to a new PDF of the correct size, but this operation cannot easily be batched. If I apply the “crop” tool to resize the page in Acrobat, I get this error:

Page size may not be reduced.

Many report these issues in Adobe’s forums. The most common responses suggest reconfiguring the scanner or buying a new one.

I found nothing quick and easy after some googling for a simple ghostscript recipe to perform the batch pre-processing necessary to allow Acrobat to do the OCR. It’s not hard to do, just a bit of a trial-and-error pain to get the right switches.

For posterity, then, here is a simple command-line to make this happen (here under Windows, but could obviously easily be adapted for any other platform). First, download the latest ghostscript for your platform (at this time, 8.64 for Windows). Then:

gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -dPDFFitPage INPUT.pdf

And a simple inelegant script to batch process (again, under Windows/cygwin, but easily adaptable). Feel free to make more elegant:

for x in "$@"
echo -n Processing $x ...
if [ ! -e "$x" ]
echo File $x missing. Exiting.
exit 1
if [ -e gs_shrink_to_letter.pdf ]
echo Tempfile gs_shrink_to_letter.pdf exists. Exiting.
exit 1
if ( gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=gs_shrink_to_letter.pdf -dPDFFitPage "$x" )
echo Success.
mv gs_shrink_to_letter.pdf "$x"
echo Error occurred, exiting.
exit $?

After converting your PDFs as above, you can then apply Acrobat batch OCR without a hitch.

ISO Kids Game

Dear Lazyweb:

I’m looking for well-designed computer games that meet the following criteria:

(1) Appropriate for a bright four-year-old with low vision (but able to read large print)
(2) minimal/no advertising
(3) preferably Flash/web-based
(4) some educational value (math, reading, etc.)

A few Google searches haven’t turned up much promising. Any suggestions?

China Masks

Among many of Jonah’s recent striking photographs from China and elsewhere in Asia, this series on Swine Flu masks is particularly eye-grabbing:

Jonahs Swine Flu Photographs

Jonah's Swine Flu Photographs