Windows curly quotes, accented characters on Linux Samba Shares and Cygwin XTerm: How to get Windows-1252 (AKA CP1252) from Linux

Before I forget: I have a bunch of files I mirror between Windows/NTFS and Linux/ext4 filesystems that include not only accented characters but curly quotes in the filenames. (I know: the easiest solution would be to just get rid of the extended characters). The curly quotes were created in Windows, so don’t render properly in standard Linux character sets (UTF-8, iso8859-1, iso8859-15, etc.).

This all came up because iTunes under Windows couldn’t find curly-quote files when it was reading from the exported Samba share filesystem rather than an attached NTFS drive. The files showed up as missing because they had different filenames.

The solution was not easily google-able, so for the record, in brief, add this to the [Global] section of /etc/samba/smb.conf:

unix charset = cp1252
display charset = cp1252

And reload Samba.

Also, to make the characters render properly from a terminal on the Linux box, first create the relevant character set:

sudo localedef -f CP1252 -i en_US en_US.CP1252

Now you can use this charset on your Linux box, and, like magic, the curly characters will be back:

export LC_ALL='en_US.cp1252'

Free Tip: How to resize scanned PDFs with ghostscript for Adobe Acrobat OCR

I’m unaware of any free tool to perform OCR on a PDF and embed the resulting data in the PDF itself so it is text-searchable. If anyone knows of one, let me know. In the meantime, I use Acrobat Professional for this essential functionality.

High resolution PDFs produced by my scanner (HP Officejet Pro L7700) usually give the following error when I try to perform Acrobat OCR:

This page is larger than the maximum page size of 45 inches by 45 inches.

Surprisingly, there doesn’t seem to be any way to resize the page size of a PDF within Acrobat. It’s possible to print to a new PDF of the correct size, but this operation cannot easily be batched. If I apply the “crop” tool to resize the page in Acrobat, I get this error:

Page size may not be reduced.

Many report these issues in Adobe’s forums. The most common responses suggest reconfiguring the scanner or buying a new one.

I found nothing quick and easy after some googling for a simple ghostscript recipe to perform the batch pre-processing necessary to allow Acrobat to do the OCR. It’s not hard to do, just a bit of a trial-and-error pain to get the right switches.

For posterity, then, here is a simple command-line to make this happen (here under Windows, but could obviously easily be adapted for any other platform). First, download the latest ghostscript for your platform (at this time, 8.64 for Windows). Then:

gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -dPDFFitPage INPUT.pdf

And a simple inelegant script to batch process (again, under Windows/cygwin, but easily adaptable). Feel free to make more elegant:

#!/bin/bash
for x in "$@"
do
echo -n Processing $x ...
if [ ! -e "$x" ]
then
echo File $x missing. Exiting.
exit 1
fi
if [ -e gs_shrink_to_letter.pdf ]
then
echo Tempfile gs_shrink_to_letter.pdf exists. Exiting.
exit 1
fi
if ( gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=gs_shrink_to_letter.pdf -dPDFFitPage "$x" )
then
echo Success.
mv gs_shrink_to_letter.pdf "$x"
else
echo Error occurred, exiting.
exit $?
fi
done

 
After converting your PDFs as above, you can then apply Acrobat batch OCR without a hitch.

ISO Kids Game

Dear Lazyweb:

I’m looking for well-designed computer games that meet the following criteria:

(1) Appropriate for a bright four-year-old with low vision (but able to read large print)
(2) minimal/no advertising
(3) preferably Flash/web-based
(4) some educational value (math, reading, etc.)

A few Google searches haven’t turned up much promising. Any suggestions?

China Masks

Among many of Jonah’s recent striking photographs from China and elsewhere in Asia, this series on Swine Flu masks is particularly eye-grabbing:

Jonahs Swine Flu Photographs

Jonah's Swine Flu Photographs

Algorithmic Glitch

How did Facebook come up with this?

How did Facebook come up with this?

Comcast Upgrade

Not bad!

Not bad!

Dismantle Storrow Drive

Most brilliant idea so far this year. In brief, Storrow Drive was never supposed to exist. Now it needs massive repairs, which will be both expensive and disruptive. Rather than fix it, some are calling to simply remove it, similar to when San Francisco decided to tear down its elevated highway after it was severely damaged in an earthquake. Proponents of dismantling Storrow Drive include former Secretary of Transportation Fred Salvucci and former DPW associate commissioner Ken Kruckemeyer. Not your typical wild-eyed anti-car fanatics.

The nay-sayers — and this Radio Boston episode shows there are many — fail to understand induced demand. Many share the naive belief that if a highway is removed, all the traffic it once carried will be redistributed to other roads, thus further increasing congestion. But it’s not a zero-sum-game. Numerous examples show that tearing down a highway can actually relieve traffic — not to mention result in enormous aesthetic and environmental benefits. Road networks are dynamic systems — change one parameter and the rest will readjust as well: gas prices, tolls, road congestion/capacity, suburban and urban property taxes, MBTA fares and service levels, regional land use and transportation planning policies, all feed into each other. One caller to the Radio Boston show claimed she needs to use Storrow Drive daily because there isn’t enough parking at Alewife! (I don’t think I need to spell it out, but I think we can safely assume that improving the Alewife parking garage will be a good bit cheaper than rebuilding Storrow, without needing to look up the precise numbers. Plus it will reduce congestion, improve air quality, and increase T ridership).

Of course it would be equally naive to assume any road can be removed without consequence on congestion, but dismantling Storrow Drive seems like a perfect start for the post-carbon era.

Political Politicians

Kudos to Martha Coakley for challenging the Federal Defense of Marriage Act. I wonder how Justice responds when they’re asked to defend a statute that the Administration has said should be repealed. Perhaps with a tepid defense.

What I don’t understand is those who criticize Coakley by claiming that her motivations are “purely political”. What’s up with that? Aren’t politicians supposed to act politically?

Of course, we do want our elected officials to have some backbone, particularly to resist popular outbursts that might have bad policy consequences. (Obama’s effective neutralization of the “Buy America” stimulus bill provision is a good example). But here we have a politician taking a strong stand for the rights of a long-disenfranchised minority group; if her motivations are “purely political,” then let’s elect more politically-motivated politicians.

MBTA Blocking TPM

I’ve been happy to see WiFi appearing on nearly every MBTA commuter rail car recently. I was less happy to see this:

No TPM on MBTA

No TPM on MBTA

I guess I’ll have to wait until I get home to find out why this bothered Steve so much.

Oddly, the MBTA’s web filter also blocked access to my WordPress editor, but unlike the TPM block, I could select “yes, I really want to do this” to get here.

I’ve never understood why web filters so often block these sorts of sites on apparently generic settings. “General News/Blogs/Wikis” are dangerous? Reputation “neutral”? I’d be surprised if anyone at the T actually did this on purpose, but I suppose it would fit the general pattern of operational incompetence.

Update: the problem appears to be real.

Jonah’s New Photo Blog

I’ve got to give a shout out to my brother Jonah’s new photo blog, so you can finally keep up with his exploits via RSS. His recent work from Algeria is amazing:

Jonahs New Blog