Popularity Out of Control

This blog has been semi-offline for at least a few hours. Sorry about that.

Through an interesting (at least to me) series of events, I discovered that my custom “popularity” hack/plugin was out of control.

It started earlier today when one of the the RAID discs on this server gave a SMART error. I’ve never been able to really judge whether SMART errors are “for real” or not—sometimes I’ve had a disc with a serious SMART error that, after another test, will run fine for another year or two without incident.

In any case, I wasn’t so worried, since this system is running RAID-5, so a single disc failing, even catastrophically, doesn’t result in any data loss or downtime.

I decided to take the disc out of the RAID and run some SMART tests on it. I marked it with mdadm as FAILed (mdadm —fail) and then removed it. Strangely, the disc first passed a short test with no problems, but then ceased to be recognized as having SMART capabilities at all.

I’ve had that problem before and never figured it out—the SMART capabilities seem to come back eventually. Not knowing what best to do, I just put the drive back into the RAID to see what would happen.

Since the disc had already been marked as FAILed, the reconstruction started from scratch. Although I’ve got a pretty good (2Ghz) CPU and plenty (2G) of RAM, kblockd and md0_raid5 are using up most CPU cycles and everything has slowed down significantly. Reconstruction seems to be proceeding fine, just slowly—it will take 24-36 hours at the current rate. Presumably it would be much faster in single user mode, but I can’t take this server offline.

In the meantime, I noticed that my blog had stopped responding entirely, while other blosxom blogs served from this same server were fine (maybe a little more latency than usual). So I started parsing through the differences in my blosxom installation and everyone else’s to see what could possibly be slowing things down so much as to time out the blog.

Eventually I identified the culprit: my custom-made “popularity” plugin, which reports the most popular entries on this blog and the number of hits they have received. I hacked it together several years ago. I think at the time I just wanted to see if it would work, with the plan to come back later and fix it. I guess I forgot the “fix it” part.

My popularity plugin creates a log file that it reads in each time the blog is accessed, and then appends to that file. Over the years, that file has grown to 17 megabytes. Although this is a huge waste of system resources, I didn’t notice it when the system was running at full speed. With the reduced performance from the RAID reconstruction, however, this meant that my blog never finished loading at all.

If you made it this far, you certainly deserve some sort of geek admin award. Congratulations. I deserve some sort of stupid admin award, myself.

P.S. Apparently I must have forgotten to check my horoscope today. The laundry machine flooded the basement when I didn’t check the sink into which the laundry machines empties. I should probably avoid sharp objects for the rest of the day.

Feed with Comments

Due to popular request, I’m now providing a separate RSS feed that includes comments. If you subscribe to my blog and would prefer to see people’s comments, just point your aggregator at http://adam.rosi-kessel.org/weblog/comments.rss.

I am a Statistic

Take the MIT Weblog Survey

(yeah, just following the flock, but I think it’s a good cause.)

Clever Referer Spam

Update (2/26/06): Someone associated with the ‘nipple huggers’ site has written to complain about my accusations here. She also has left a couple of comments below. Just to be clear, there is no evidence that the site sends email spam, uses obtrusive popups, or installs spyware/adware, etc., on your computer. It appears simply that someone has attempted to optimize their position in search results by generating HTTP requests to other popular sites with their domain name in the referer field.


I used to have a big problem with “referer spam.” What is referer spam? My weblog lists “inbound links” on the right column so visitors can see who else has linked here. Since many weblogs provide a similar list, spammers began to create “spurious” inbound links so their URLs would appear in the right column of many weblogs, thus boosting their Google PageRank·. Usually, if you went back to the site that ostensibly linked to my weblog, it would be a porn or gambling site with no true links to my weblog.

This was easy enough to fix: I wrote a handmade filter that regularly checks all the putative inbound links and verifies that they do, in fact, link to my site.

Just today, I found my first instance of a spammer adaptation: the inbound link came from a site selling “nipple huggers” — some sort of jewelry that I don’t quite understand. I was curious how the site escaped my “referer check” script, so I checked it out. It turns out the “nipple hugger” site does link to my blog, with the link text “PopUp Scam – Click X to Close.” The linked page on my site has nothing to do with popup scams, but it is an interesting workaround to my filter. Rather than generating fake/spurious links, apparently real visitors to the “nipple hugger” site click on the link to my blog, and generate “real” referer links. Just today, I received inbound links from ten different hosts from the “nipple hugger” page.

I can’t think of any clever way to automatically filter these sorts of inbound links, because they really don’t look any different from genuine inbound links. At this point, I’m just inserting a keyword filter for known bad referers (just the “nipple hugger” at this point). Suggestions for more clever ways to escalate this arms race are welcome.

(I really hope my site doesn’t become a top search result for “nipple hugger” now. If it does, please, look elsewhere, I don’t even know what they are!)

Bloglines and the Perils of Syndication

Martin Schwimmer (The Trademark Blog) posts an interesting discussion about why he doesn’t allow his RSS feed to be carried by bloglines. Bloglines bills itself as “the most comprehensive, integrated service for searching, subscribing, publishing and sharing news feeds, blogs, and rich Web content.” Or, in other words, it aggregates different weblogs and other sources that publish in RSS format so that a reader can get all their selected information from one website.

Although many people use “offline” RSS aggregators like Straw for GNU/Linux and SharpReader for Windows (I don’t know what OS X people use), for people who don’t access the web through a single computer all the time, a “free” website that performs this aggregating service sounds like a good idea.

The problem, Schwimmer points out, is that Bloglines has a business plan. And that business plan has been described by at least one analyst as AdWords on Steroids. Bloglines plans to use weblog content written by other people for data mining and targeted advertising, without the writer’s permission.

This doesn’t sit well with me. First, as an online privacy advocate (despite my recent outing of two anonymous U-Haul commentors), I’d rather not provide grist for data collection and profiling, especially where the readers are quite unlikely to realize what is happening. Second, I have no control over the content of the ads that might surround my blog. Google AdWords has provoked a lot of controversy (not to mention several lawsuits) by selling trigger words to advertisers that include competitor’s trademarks. I think Google is probably right, both legally and in terms of commercial ethics, in that scenario—consumers searching for ‘Nike shoes’ might in fact benefit from a link to New Balance with the description ‘New Balance shoes are cheaper and better quality!’, and aren’t likely to be confused about the source or origin of what they’re getting.

I am less comfortable with the idea that there might be ads surrounding my weblog entries for porn, online gambling, or worse — legal services. Unlike the Google AdWords example, in that case Bloglines (or another commercial aggregator/data miner) would be using the fruits of my own labor in a way that might associate me with entities I do not want to endorse or that might be in direct commercial competition with me. It’s fairly intuitive to think that not only does an advertiser endorse particular content, but that the creator of that content at least nominally endorses the advertiser. This is why political magazines like Ms. Magazine did not accept advertising for many years (although they do, within certain limits, now).

Finally, from an economic perspective, it seems to me that Bloglines would be profiting without really doing anything productive or creative: the only value-added is the advertising itself, and perhaps the aggregation feature, but that is available for free without advertising from other sources.

It’s useful to compare the function of commercial Linux distributors like SuSE and Red Hat with Bloglines. The commercial Linux distributors take free content, package it, certify it in some way, support it, help fix bugs, provide a “bricks and mortar” infrastructure for getting the product out there, all requiring a substantial input of resources. To take blog content and put ads around it, on the other hand, requires almost no creative (or other) resources. I suppose they are providing some bandwidth that might be useful if the blog publisher is short on that, but a better solution in that case would be for the blog publisher to run ads themselves and use the money to pay for more bandwidth.

Presently it appears that I have seven or eight Bloglines subscribers. I won’t be cutting them off any time soon, but I am considering licensing my blog enter a noncommercial Creative Commons license that should prohibit the kind of data mining and advertising that Bloglines is planning with content I create. Although I think that kind of license is inappropriate for most software (and certainly doesn’t comply with the Debian Free Software Guidelines) I think it might be the only way to avoid some of the consequences discussed above.

Broken

My weblog seems to have entirely disappeared from the front page. I’m just creating this entry to see if it returns. Quite busy starting work these days; I expect to resume semi-regular blogging in a week or two.

Update: it’s fixed. I had hundreds of empty ‘names.txt’ in my blog directories; the result of an errant script. Blosxom saw empty blog entries in every category and attempted to display them.

Hackergotchi

Here is my Hackergotchi. I’ve never been too adept at The Gimp, but I think it’s not too bad, right? The real problem is that it’s hard to have a beard and a proper drop-shadow. Maybe if I lightened up my beard a bit so the shadow is more obvious… (probably won’t look quite right in Internet Explorer, which still can’t render transparent PNGs properly).

Spam Be Gone

I think I’ve found a solution to my persistent “spam referrer” woes, where porn sites (particularly “Paris Hilton” related—to whom, I continue to assert, I have no connection whatsoever) create spurious links from weblogs and boost their Google PageRank. About twice an hour, I have a script that looks up all the “recent inbound links” sites and checks to see if they actually link to my site. If they don’t, they’re removed.

I’m sure a few legitimate inbound links will be removed in the process, but it’s much preferable to having to manually cull out all the porn sites. As it turns out, porn sites never actually link to me!

I wonder how long it will take for the spam referrers to figure out a way around this filter.

Internet Pestilence

I’ve written about my tribulations with “Paris Hilton” related referrer spam before. Since my weblog tracks “inbound links” on the right side, spammers create spurious inbound links into my site so that their site will be linked from mine and thus have greater visibility and a higher Google PageRank. My solution has been to ban anything with the words “paris,” “hilton,” and a host of other porn-related terms from the list.

Starting today, I’m starting to get a new breed of referrer spam: Janet Jackson superbowl video referrers. Maybe it’s a bad idea to track inbound links at all. Or maybe the solution is to have my referrer tracker actually look at the supposed inbound link and make sure that it does, in fact, link to my site. In any case, I’ve now added a bunch of Janet Jackson related terms to my banned list.

How will this arms race end?

While I’m talking about scourges of the Internet, what’s the deal with autoreply virus/worm detectors? A huge number of corporate and educational mailservers scan incoming email for worms and viruses, and if they detect a worm or virus send a message to the sender telling them the message was subscribed and that they are infected. Usually, the autoreply also includes a plug for the email scanner software itself.

So how is it that the developers of this software are smart enough to include the distinctive signatures of all these email worms, but not smart enough to realize that those same worms always forge the “from:” part of the header. That means if the apparent sender actually is infected, it’s at best a total coincidence. There is no connection, with most worms, between the “sender” of an email and the person who is actually infected with the worm. (other than some third person who is infected might have the apparent sender in their address book). Presumably these software developers are smart people and spend some time trying to understand email worms and viruses, and send out frequent updates of the distinctive signatures of worms and viruses.

Does anyone have a rational explanation? Even better, can someone educate these software developers and the people who purchase their software to end this scourge of false “virus detected” emails?

Zippy the Pinhead on Paris Hilton

(maybe someday people will come to my weblog based on something other than non-existent Paris Hilton materials!)