Windows curly quotes, accented characters on Linux Samba Shares and Cygwin XTerm: How to get Windows-1252 (AKA CP1252) from Linux

Before I forget: I have a bunch of files I mirror between Windows/NTFS and Linux/ext4 filesystems that include not only accented characters but curly quotes in the filenames. (I know: the easiest solution would be to just get rid of the extended characters). The curly quotes were created in Windows, so don’t render properly in standard Linux character sets (UTF-8, iso8859-1, iso8859-15, etc.).

This all came up because iTunes under Windows couldn’t find curly-quote files when it was reading from the exported Samba share filesystem rather than an attached NTFS drive. The files showed up as missing because they had different filenames.

The solution was not easily google-able, so for the record, in brief, add this to the [Global] section of /etc/samba/smb.conf:

unix charset = cp1252
display charset = cp1252

And reload Samba.

Also, to make the characters render properly from a terminal on the Linux box, first create the relevant character set:

sudo localedef -f CP1252 -i en_US en_US.CP1252

Now you can use this charset on your Linux box, and, like magic, the curly characters will be back:

export LC_ALL='en_US.cp1252'

Free Tip: How to resize scanned PDFs with ghostscript for Adobe Acrobat OCR

I’m unaware of any free tool to perform OCR on a PDF and embed the resulting data in the PDF itself so it is text-searchable. If anyone knows of one, let me know. In the meantime, I use Acrobat Professional for this essential functionality.

High resolution PDFs produced by my scanner (HP Officejet Pro L7700) usually give the following error when I try to perform Acrobat OCR:

This page is larger than the maximum page size of 45 inches by 45 inches.

Surprisingly, there doesn’t seem to be any way to resize the page size of a PDF within Acrobat. It’s possible to print to a new PDF of the correct size, but this operation cannot easily be batched. If I apply the “crop” tool to resize the page in Acrobat, I get this error:

Page size may not be reduced.

Many report these issues in Adobe’s forums. The most common responses suggest reconfiguring the scanner or buying a new one.

I found nothing quick and easy after some googling for a simple ghostscript recipe to perform the batch pre-processing necessary to allow Acrobat to do the OCR. It’s not hard to do, just a bit of a trial-and-error pain to get the right switches.

For posterity, then, here is a simple command-line to make this happen (here under Windows, but could obviously easily be adapted for any other platform). First, download the latest ghostscript for your platform (at this time, 8.64 for Windows). Then:

gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -dPDFFitPage INPUT.pdf

And a simple inelegant script to batch process (again, under Windows/cygwin, but easily adaptable). Feel free to make more elegant:

#!/bin/bash
for x in "$@"
do
echo -n Processing $x ...
if [ ! -e "$x" ]
then
echo File $x missing. Exiting.
exit 1
fi
if [ -e gs_shrink_to_letter.pdf ]
then
echo Tempfile gs_shrink_to_letter.pdf exists. Exiting.
exit 1
fi
if ( gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=gs_shrink_to_letter.pdf -dPDFFitPage "$x" )
then
echo Success.
mv gs_shrink_to_letter.pdf "$x"
else
echo Error occurred, exiting.
exit $?
fi
done

 
After converting your PDFs as above, you can then apply Acrobat batch OCR without a hitch.

WordPress “Pages” No Longer Work

All the “pages” linked from my weblog — for example, my “about” page and my PGP key — are broken. I’ve posted in the WordPress Support Forums with no luck. I’m not sure when or why they stopped working, but if any readers have any suggestions of how to troubleshoot, I’d love to hear about it. Nothing relevant appears in server logs.

In the meantime, apologies if you came here trying to find out about me. I’m temporarily out of service.

Proof of Fall 2007

I’ve been playing around with Gallery and wpg2. I’m still a bit puzzled attempting to integrate Gallery and WordPress. I’ve resolved most issues; the main remaining issue is to display images in the Ajaxian theme without running over the borders in the Ajax/slideshow views. Also, the embedded image apparently doesn’t render in the RSS feed. Update: I’ve given up on the G2 tinymce plugin and the WPG2 tag for now and just hardcoded the image and album URL. Update 2: now the embedded image is working again for no good reason. Suggestions on the entire configuration are welcome.

In any case, I took some pretty photos today in our back yard (use left and right arrow keys to scroll through images after clicking on the one below — I still can’t get the navigation icons to appear):

27|400

Before: Proof of Spring 2007.

[Tags]Autumn, Foliage, Trees, WordPress, WPG2, Gallery[/Tags]

Search Keys For Google Patents

The Search Keys extension for Firefox is perhaps my favorite plugin. I tweaked it so it will work with Google Patent Search as well. The only additional code needed is in the searchkeys.js file:

{
name: "Google Patents",
test: function (uri) { return uri.host.indexOf("google") != -1 && uri.path.substr(0,9) == "/patents?"; },
testLink: function (linkNode) { return (linkNode.className == "big"); }
},

I added a new minor version number and posted it here for download. Hopefully the patch will be adopted upstream.

Sodoku SQL Solution

In case this blog hasn’t been meeting its geekiness quota, I present the following. Ken (who deserves to be better known in the blogosphere and elsewhere) wrote a Haskell program to solve Sodoku with a single SQL query:

SELECT * FROM values AS aaaa, values AS aaab, values AS aaba, values AS aabb, values AS aabc, values AS aaca, values AS abab, values AS abba, values AS abbb, values AS abbc, values AS abcb, values AS abcc, values AS acba, values AS acbc, values AS acca, values AS accc, values AS baaa, values AS baac, values AS babb, values AS babc, values AS bacb, values AS bacc, values AS bbba, values AS bbbc, values AS bcaa, values AS bcab, values AS bcba, values AS bcbb, values AS bcca, values AS bccc, values AS caaa, values AS caac, values AS caba, values AS cabc, values AS cbaa, values AS cbab, values AS cbba, values AS cbbb, values AS cbbc, values AS cbcb, values AS ccac, values AS ccba, values AS ccbb, values AS ccbc, values AS cccb, values AS cccc WHERE aaaa.v <> 3 AND aaaa.v <> 9 AND aaaa.v <> baaa.v AND aaaa.v <> 4 AND aaaa.v <> bcaa.v AND aaaa.v <> caaa.v AND aaaa.v <> cbaa.v AND aaaa.v <> 8 AND 3 <> 9 AND 3 <> baaa.v AND 3 <> 4 AND 3 <> bcaa.v AND 3 …

(edited for brevity)

WordPress Upgrade –> 2.2

Just upgraded to WordPress 2.2. That took all of 18 seconds.

Grimmelmann on PrawfsBlawg

Not to be missed: well-known enfant terrible James Grimmelmann is guest-blogging on PrawfsBlawg. His opening commentary on the relationship between law practice and computer science:

Practicing lawyers, like practicing programmers, are professional pragmatists. Both must make their cases (and case mods) out of the materials they have available; both starve or eat steak depending on whether their creations work. The day-to-day practice of law is unlikely ever to require much high theory. We can mourn that fact because it means that they look at us with suspicion, or celebrate it because it frees us to chase Truth and Beauty—and it will remain a fact either way.

Aside from the fact that I don’t eat steak, I think this is correct.

Via a commenter on James’ entry, I learned that the 7th Circuit Court of Appeals is implementing a wiki (the entry page could surely use some more content). Surprisingly, it was not Posner but Easterbrook who spearheaded the effort. This is a very interesting development, but I expect it will be quite a while before any other circuit takes up the idea.

Finally, I have been meaning to write about this New York Times story describing Jonathan Coulton’s success as a musician breaking with the traditional distribution /promotional channels (via 43 folders, a productivity blog that is still on my “probation” list). Unfortunately, slashdot beat me to it. I first re-discovered Jonathan Coulton during his guest episode of the Show with Ze Frank. In any event, the article is well worth reading:

More than 3,000 people, on average, were visiting his site every day, and his most popular songs were being downloaded as many as 500,000 times; he was making what he described as “a reasonable middle-class living” — between $3,000 and $5,000 a month — by selling CDs and digital downloads of his work on iTunes and on his own site…

Coulton realized he could simply poll his existing online audience members, find out where they lived and stage a tactical strike on any town with more than 100 fans, the point at which he’d be likely to make $1,000 for a concert. It is a flash-mob approach to touring: he parachutes into out-of-the-way towns like Ardmore, Pa., where he recently played to a sold-out club of 140….

In total, 41 percent of Coulton’s income is from digital-music sales, three-quarters of which are sold directly off his own Web site. Another 29 percent of his income is from CD sales; 18 percent is from ticket sales for his live shows. The final 11 percent comes from T-shirts, often bought online…

randomplay 0.60 released

Today I released version 0.60 of randomplay, my command-line shuffle-recall-swiss-army-knife music player. It will never make Winamp users happy, but it’s a good substitution for complex combinations of find/grep/xargs/sort that people sometimes use to pick tracks to play. If you can’t see why you’d use it, you probably don’t need it.

The latest version adds two new command-line options, —older-than and —newer-than. These can be used to limit the songs included in the shuffle on the basis of the file modification date. The syntax is fairly flexible, and resembles that used by rdiff-backup for restoration commands. For example:

Randomly play music under the ~/music directory that were added in the past week:

randomplay --newer-than 1W ~/music

Play in order music that is from before this year:

randomplay --norandom --older-than '1/1/2006'

Give a list of filenames of music that were added in the past 6 months, but haven’t been played in the last three months:

randomplay --names-only --newer-than 6M --days 3M

Play, but don’t record in the playing history, music added in the first three months of 2004:

randomplay --noremember --newer-than '1/1/2004' --older-than '4/1/2004'

Unfortunately, this new feature is pretty slow, because it stats each file individually on the initial spidering of the directories to be played. In fact, the startup is always fairly slow if you are searching a large directory hierarchy, since randomplay does not preserve any file index but checks anew on each execution. If you are searching tens of thousands of tracks over NFS (as I do), this can take a minute or so. Suggestions for improving the perfomance of the file modification time detection or of the whole startup are welcome. At some point, I will probably implement an indexing feature, but I like the simplicity of it now where it works basically like the shell find command.

Version 0.60 should show up in Debian unstable shortly.

Yahoo Qmail Daemon and Mailman

My server’s mailman (or postfix) installation is mysteriously rejecting mail from one Yahoo! mail user. I don’t get it:

 Hi. This is the qmail-send program at yahoo.com. I'm afraid I wasn't able to deliver your message to the following addresses. This is a permanent error; I've given up. Sorry it didn't work out. : 72.1.169.10 does not like recipient. Remote host said: 550 : Recipient address rejected: undeliverable address: unknown user: "[list name]" Giving up on 72.1.169.10. 

72.1.169.10 is, in fact, the IP address of my server. [list name] is (in the real version) the real live name of the list. The list seems to work for everyone else. And it’s certainly not true that I, or my server, doesn’t like this recipient (or sender).

Aside from this anomalous behavior, it’s also funny that Yahoo! provides plain old unfiltered qmail bounce messages to its users. Wouldn’t you think a fully matured webmail service like Yahoo! would, by this point, have somewhat customized their mail error reporting messages? In fact, wouldn’t you think they would want to hide the fact that the use qmail at all, if only for security purposes? Couldn’t they hire an intern to write a few replacement error messages? Maybe I’m missing something.

A propos, I discovered this nice piece from McSweeney’s, entitled YAHOO’S MAILER-DAEMON AUTOMATED REPLY FOR FAILED E-MAIL DELIVERY IS GETTING A LITTLE TOO INTIMATE.

Update 5/30/06: Figured it out. Oddly, Yahoo! was looking up the CNAME DNS record for the domain name and replacing that in the mail header. While the original email went to e.g., listname@lists.mydomain.com, the message as delivered was addressed to listname@servername.mydomain.com. Because only lists.mydomain.com processed email for lists, the message bounced. The solution was to change lists.mydomain.com from being a CNAME entry to its own A entry with the IP address specified directly. That fixed the problem. I’ve never seen any other mail service work this way — gmail certainly doesn’t.