I started this blog over a decade ago. Over time, my priorities have changed—family, work, home, etc. Many other avenues for online self-expression have also developed in the interim. I’m done for now. Old entries remain online, but do not necessarily reflect any of my current views, and certainly do not reflect any views of any of my employers.
Another LazyWeb request: any suggestions for how to search (on any platform) for PDFs that have not been OCR’d?
I have a bash script with a while loop that takes a long time to process. It restores file modification times for complicated reasons not worth discussing here. Removing some nonessential stuff, I have the following code (I know it could be rewritten to be elegant, or at least collapsed into a single line):
`` cat ./preserve_file_mod_times | while read x do filename=`echo $x|sed 's/|.*//g'` lastmod=`echo $x|sed 's/^.*|//g'` touch -t "$lastmod" "$filename" done ``
As the input file has gotten longer, the loop now frequently fails:
376 Profiling timer expired touch -t "$lastmod" "$filename"
I’ve googled this error and understand what it is (i.e. SIGPROF) but not how to fix or workaround it. Any hints?
Can you suggest open source content management software that meets these criteria:
- Very easy to install and configure (from the admin side) and use (from the user side) — I’m thinking as easy from both sides as vqwiki
- Drag-and-drop to upload content — ideally, the user could drag a DOC file from desktop into a widget in the browser to upload
- Quick searching and indexing, at least of common file-types (including DOC and PDF)
- Ability to set up arbitrary metadata elements and values that can assist as filters to searching
Generally, content will be located by search, rather than via any particular folder hierarchy.
Does it exist? There seem to be a lot of open source CMS options, but at least at first glance they may be overkill with a significant learning curve at least on the admin side.
My most successful Green Lemonade to date:
- Five or six leaves of kale
- A bunch of fresh parsley
- One-third of a medium-sized cucumber
- A thick inch of fresh ginger (peeled)
- A whole lemon (include seeds, peel, etc.)
- A whole lime (likewise)
- Two fuji apples (cored)
Juice and consume.
I have heard from green-juice skeptics before. It may be hopeless for some poor souls. But don’t knock it until you’ve tried it.
Windows curly quotes, accented characters on Linux Samba Shares and Cygwin XTerm: How to get Windows-1252 (AKA CP1252) from Linux
Before I forget: I have a bunch of files I mirror between Windows/NTFS and Linux/ext4 filesystems that include not only accented characters but curly quotes in the filenames. (I know: the easiest solution would be to just get rid of the extended characters). The curly quotes were created in Windows, so don’t render properly in standard Linux character sets (UTF-8, iso8859-1, iso8859-15, etc.).
This all came up because iTunes under Windows couldn’t find curly-quote files when it was reading from the exported Samba share filesystem rather than an attached NTFS drive. The files showed up as missing because they had different filenames.
The solution was not easily google-able, so for the record, in brief, add this to the [Global] section of /etc/samba/smb.conf:
unix charset = cp1252 display charset = cp1252
And reload Samba.
Also, to make the characters render properly from a terminal on the Linux box, first create the relevant character set:
sudo localedef -f CP1252 -i en_US en_US.CP1252
Now you can use this charset on your Linux box, and, like magic, the curly characters will be back:
I’m unaware of any free tool to perform OCR on a PDF and embed the resulting data in the PDF itself so it is text-searchable. If anyone knows of one, let me know. In the meantime, I use Acrobat Professional for this essential functionality.
High resolution PDFs produced by my scanner (HP Officejet Pro L7700) usually give the following error when I try to perform Acrobat OCR:
This page is larger than the maximum page size of 45 inches by 45 inches.
Surprisingly, there doesn’t seem to be any way to resize the page size of a PDF within Acrobat. It’s possible to print to a new PDF of the correct size, but this operation cannot easily be batched. If I apply the “crop” tool to resize the page in Acrobat, I get this error:
Page size may not be reduced.
Many report these issues in Adobe’s forums. The most common responses suggest reconfiguring the scanner or buying a new one.
I found nothing quick and easy after some googling for a simple ghostscript recipe to perform the batch pre-processing necessary to allow Acrobat to do the OCR. It’s not hard to do, just a bit of a trial-and-error pain to get the right switches.
For posterity, then, here is a simple command-line to make this happen (here under Windows, but could obviously easily be adapted for any other platform). First, download the latest ghostscript for your platform (at this time, 8.64 for Windows). Then:
gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -dPDFFitPage INPUT.pdf
And a simple inelegant script to batch process (again, under Windows/cygwin, but easily adaptable). Feel free to make more elegant:
#!/bin/bash for x in "$@" do echo -n Processing $x ... if [ ! -e "$x" ] then echo File $x missing. Exiting. exit 1 fi if [ -e gs_shrink_to_letter.pdf ] then echo Tempfile gs_shrink_to_letter.pdf exists. Exiting. exit 1 fi if ( gswin32c -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=gs_shrink_to_letter.pdf -dPDFFitPage "$x" ) then echo Success. mv gs_shrink_to_letter.pdf "$x" else echo Error occurred, exiting. exit $? fi done
After converting your PDFs as above, you can then apply Acrobat batch OCR without a hitch.
I’m looking for well-designed computer games that meet the following criteria:
(1) Appropriate for a bright four-year-old with low vision (but able to read large print)
(2) minimal/no advertising
(3) preferably Flash/web-based
(4) some educational value (math, reading, etc.)
A few Google searches haven’t turned up much promising. Any suggestions?