Bad MSN

The MSNBot recently attempted to request a page on a website I run that is not publicly linked. In fact, no page from the domain name in question is publicly linked. The robots.txt file excludes all robots. My .htaccess file also blocks HTTP requests from known search engine robot IP address ranges (including MSNBot’s). Moreover, the page requested wasn’t the top level page (i.e., http://domain.com/), but some page buried therein (i.e., http://domain.com/some_dir/some_page.html).

The only request I have in my server logs from MSNBot is for this buried page—MSNBot (at least, identified as MSNBot) never requested any of the pages that would be necessary to find this buried page. The universe of people who have access the website at this domain is very small; I can in fact identify every single IP address in the server log as someone I know.

I can only conceive of two hypotheses for this, both of them would be a bad sign for MSNBot:

  • The URL did appear in some emails to hotmail.com addresses; is it conceivable that MSN actually pulls out URLs from emails for spidering? Seems quite unlikely.
  • MSNBot visited the domain disguised, both by IP address and user agent, as someone else, to find the URL in question. I would hope MSNBot wouldn’t engage in such a poor practice, but maybe they do it to detect cloaking or similar manipulative practices.

I don’t mean to be a conspiracy theorist, but can anyone conceive of any other way the MSNBot could have even found out about the URL in question?