Blog Spam Wars Escalate

For the past year or two, I’ve kept weblog spam comments at bay with a custom hack that blacklists common spam phrases and URLs. Every month or two, a new spam format seems to evade the filter, and it’s usually easy enough to identify a unique phrase that is unlikely to appear in a legitimate comment and add it to the blacklist. (Most of these unique phrases are not appropriate for general audience consumption—suffice it to say that they often relate to unorthodox sexual activities). If I receive more than three or four spam comments from a particular IP address, I blacklist that IP address from commenting.

This approach has been fairly low maintenance, and while rather crude, has very few false negative and false positive results.

Just recently, however, my friend Jamie reports that he is getting “spam” comments on his blog that actually don’t point to any spam sites. In fact, there’s nothing in the comment or the URLs that is spam, other than the fact that the comments have nothing to do with the blog entry.

This phenomenon is apparently becoming widespread—for example, see this poor guy, who appears to have a totally legitimate blog which is now the “target” of many of these quasi-spam comments.

My guess is that these faux spams are designed to trigger automatic blacklists and thus poison the blacklists with “good” sites and presumably ruin the whole system. It’s not really effective against my technique, which involves manually blacklisting sites, but it is certainly annoying. So far none of these have hit me, but I’m sure it won’t be long.

I’m loathe to implement a captcha or login requirement on my blog—one of the great things about the blogosphere is the low barrier to entry for participation—but that may be the only choice. Any other ideas?

What’s wrong with 209.88.228.11 and/or Konqueror?

Today I received over 100,000 hits like this:

 209.88.228.11 - - [04/Oct/2005:16:57:52 -0400] "PROPFIND /error/notfound.html/ HTTP/1.1" 302 240 "-" "Mozilla/5.0 (compatible; Konqueror/3.4; Linux) KHTML/3.4.1 (like Gecko) (Debian package 4:3.4.1-1)" 

It looks like the person actually came to my site for a legitimate reason:

 209.88.228.11 - - [04/Oct/2005:09:50:41 -0400] "GET /weblog/2005/08/ HTTP/1.1" 200 53323 "http://www.google.com/search?hl=en&ie=UTF-8&q=download+growisofs+5.21+debian&spell=1" "Mozilla/5.0 (compatible; Konqueror/3.4; Linux) KHTML/3.4.1 (like Gecko) (Debian package 4:3.4.1-1)" 

and then wanted to see the contents of my /blogimages directory. That directory (where I store images that appear on this blog) cannot be publicly viewed:

 209.88.228.11 - - [04/Oct/2005:09:52:19 -0400] "PROPFIND /blogimages/ HTTP/1.1" 302 239 "-" "Mozilla/5.0 (compatible; Konqueror/3.4; Linux) KHTML/3.4.1 (like Gecko) (Debian package 4:3.4.1-1)" 209.88.228.11 - - [04/Oct/2005:09:52:19 -0400] "PROPFIND /error/notfound.html/ HTTP/1.1" 302 239 "-" "Mozilla/5.0 (compatible; Konqueror/3.4; Linux) KHTML/3.4.1 (like Gecko) (Debian package 4:3.4.1-1)" 

But why would this failed request repeat more than 100,000 times, basically every second for hours? Is this a very bad konqueror behavior, or a well camouflaged denial-of-service attack, or something else entirely? This kind of thing could generate some bad press for free software unless there’s a good explanation (“Konqueror security hole swamps innocent websites,” etc..).

Google Maps Glitches

Is it just me, or does the Free Software Foundation show up as a result in a Google Maps ‘local search’ for Lowe’s Home Improvement?

Here’s a screenshot in case people think I’m losing my mind.

At least it’s the last result—I suppose it is the least relevant of all the choices.

Apparently, I’m not the only one who has noticed that Google has a predilection for the Free Software Foundation.

Normal Conversations

Having thrown another obscure reference that was nearly immediately understood in email correspondence with Steve, I pondered how it was possible to have normal conversations before google. As it turns out, the answer is inherently unknowable. (At least for another few days).

Bad MSN

The MSNBot recently attempted to request a page on a website I run that is not publicly linked. In fact, no page from the domain name in question is publicly linked. The robots.txt file excludes all robots. My .htaccess file also blocks HTTP requests from known search engine robot IP address ranges (including MSNBot’s). Moreover, the page requested wasn’t the top level page (i.e., http://domain.com/), but some page buried therein (i.e., http://domain.com/some_dir/some_page.html).

The only request I have in my server logs from MSNBot is for this buried page—MSNBot (at least, identified as MSNBot) never requested any of the pages that would be necessary to find this buried page. The universe of people who have access the website at this domain is very small; I can in fact identify every single IP address in the server log as someone I know.

I can only conceive of two hypotheses for this, both of them would be a bad sign for MSNBot:

  • The URL did appear in some emails to hotmail.com addresses; is it conceivable that MSN actually pulls out URLs from emails for spidering? Seems quite unlikely.
  • MSNBot visited the domain disguised, both by IP address and user agent, as someone else, to find the URL in question. I would hope MSNBot wouldn’t engage in such a poor practice, but maybe they do it to detect cloaking or similar manipulative practices.

I don’t mean to be a conspiracy theorist, but can anyone conceive of any other way the MSNBot could have even found out about the URL in question?

Ads in RSS

Google is beta-testing AdSense for RSS feeds. I hope this catches on, and encourages more content providers to put the full text of their entries in their RSS feeds, rather than just initial snippets, which makes the feed nearly worthless for offline reading. I’ve complained about snippety RSS before, as have many others. If the only obstacle to including full text in feeds is fear of lack of revenue, this should fix that. (Presumably the fear is not bandwidth-related — 1 or 2 kilobytes versus a fraction of a kilobyte per entry shouldn’t be an issue for anyone anymore, if it ever was).

Craigslist into Outer Space

Maybe I missed this the first time around, but I just noticed that craigslist is providing an opportunity to have free postings sent into outer space·.

From the FAQ:

Q: Is this a hoax.
A: No.

I also noticed that craigslist is supporting the Spread Firefox campaign by posting links to the campaign on just about every page.

Go craiglist!

Google Maps and Craigslist = Great

Hat tip to Steve: Some genius has integrated Craigslist and Google Maps. Better, even, than chocolate and peanut butter.

(especially if you’re allergic to peanut butter)

IM(usic)DB?

Why isn’t there an Internet Music Database analogous to the Internet Movie Database? That is, a collaborative/community-built resource filled with information about musicians and albums?

The closest thing we have is freedb, or the commercial-equivalent cddb, but these are very different from what I’m imagining. With freedb, you can look up artists, albums, tracks, and genres. That’s about it. No lists of the musicians/instruments playing on each track. No links to reviews, commentary, images of the musicians, etc.. No collaborative quality rating system. None of the other various resources that make IMDB the top result in Google for most movies (usually higher even than the studio’s own page for that movie).

So am I missing something? If not, anyone interested in starting the Internet Music Database?

Google News Lawsuit Update

A few interesting bits to follow up on Sunday’s entry about Agence France Process vs. Google.

It turns out this suit was filed in the United States, not France as I had suspected. Eric Goldman has posted the complaint, which was filed in Federal District Court in the District of Columbia.

AFP alleges (in Paragraph 23) that Google sells advertising as part of its service, presumably setting up an argument against fair use. Google has scrupulously avoided placing advertising on Google News, however, presumably also for the same reason. I wonder if Google’s attempt to “fence off” it’s “noncommercial” content will be effective in asserting a fair use defense—I’m not aware of any law directly on point.

Another interesting allegation (Paragraph 76) is that Google removes the AFP watermark from AFP images, which, if true, leads to liability under the DMCA. It seems unlikely to me that Google would actually digitally process images to remove watermarks, though. Perhaps the news sites who publish AFP images, and from whom Google extracts content, remove the watermarks. AFP’s claim that Google removes “AFP” from the text of the byline seems more plausible, but perhaps less actionable as copyright management information.

The article I linked earlier on this subject mentioned:

AFP say Google have ignored all attempts to stop the indexing while Google say all publishers have the option of not being indexed and included in Google News.

My guess here is that the Google News crawler does respect robot instructions, and the crawler is not actually getting any content from AFP directly—instead, it is indexing other publications that include AFP syndicated content. Arguably, then, if Google is a direct infringer, then the publications from which Google is grabbing content are contributory infringers for not excluding their content from Google’s system.

This leads to an important question, which is: what is the default assumption when you post something on the web? Under copyright law, it’s fairly clear that the default is “all rights reserved,” and you need to ask permission to copy, modify, redistribute, etc.. Under the modus operandi of the web, however, it seems the default assumption is that reproduction such as caching and indexing in search engines is okay. So do news publications that carry AFP content have a duty to exclude the Google News crawler from their sites in order to avoid liability for contributory infringement, or does Google have a duty to ask permission from each and every publication, and then possibly from the “upstream” publisher (in this case AFP), before it can quote any material, setting aside any fair use defense for the time being?

It seems to me that the most efficient result here would be to assume certain uses (caching, indexing, summarizing) can be made of content on the web unless the publishers indicates otherwise, through a robots exclusion file or some other standard method. This would be consistent with certain areas of the law that require a fence to be set up around property before it will be protected—for example, information cannot be protected as a trade secret unless the owner of the information actually takes step to preserve the secrecy. This approach would be, however, inconsistent with the weight of authority in copyright law, which is that the author need not take any steps to protect the work to receive the protection of the law.

Update: Professor Eric Goldman’s analysis of the case, also discussed on the Trademark Blog and John Battelle’s Searchblog.