Bad MSN

The MSNBot recently attempted to request a page on a website I run that is not publicly linked. In fact, no page from the domain name in question is publicly linked. The robots.txt file excludes all robots. My .htaccess file also blocks HTTP requests from known search engine robot IP address ranges (including MSNBot’s). Moreover, the page requested wasn’t the top level page (i.e., http://domain.com/), but some page buried therein (i.e., http://domain.com/some_dir/some_page.html).

The only request I have in my server logs from MSNBot is for this buried page—MSNBot (at least, identified as MSNBot) never requested any of the pages that would be necessary to find this buried page. The universe of people who have access the website at this domain is very small; I can in fact identify every single IP address in the server log as someone I know.

I can only conceive of two hypotheses for this, both of them would be a bad sign for MSNBot:

  • The URL did appear in some emails to hotmail.com addresses; is it conceivable that MSN actually pulls out URLs from emails for spidering? Seems quite unlikely.
  • MSNBot visited the domain disguised, both by IP address and user agent, as someone else, to find the URL in question. I would hope MSNBot wouldn’t engage in such a poor practice, but maybe they do it to detect cloaking or similar manipulative practices.

I don’t mean to be a conspiracy theorist, but can anyone conceive of any other way the MSNBot could have even found out about the URL in question?

14 comments

  1. Ploum Jan 28

    It seems that MSNbot access all IP that appear in a MSN chat.
    You are not the first I hear about in this case.

  2. Jeremy Nickurak Jan 28

    Outlook/Explorer/Messenger watching URL’s discussed/browsed as a source for msn indexing targets?

  3. acid Jan 28

    Maybe someone typed this URL in his search/URL toolbar and its f*cking browser did try to search it with MSN search engine ?

    Since the search engine did not find it, it tried (maybe later) to check it…

  4. Adam Rosi-Kessel Jan 28

    Interesting suggestions. It should be possible to test the MSN chat hypothesis–create a “trap” for the MSNBot with a URL that is only used in MSN chat. Has anyone done that? I would think there might be a minor privacy uproar if MSN is monitoring chats for ideas of what to spider. hotmail would be even worse.

  5. Thomas Jan 28

    Your second point isn’t possible if you know every other entry in the log.
    Are you 100% sure that nobody that knows the URL ever linked to that page? Probably using a bookmarking service or something similar.
    The automatic msnsearch explanation seems reasonable too.

  6. Foo Jan 28

    Why would it be so unlikely that Microsoft snoops on hotmail mails? AFAIK they send MSNBot to urls that appear in msn messenger messages.

  7. Pharao Jan 28

    you can keep the bot away from your page with some rewrite rules in the apache config.
    There was a link in http://www.linux-web.de/artikel/5031/hilight,msnbot/kein-MSNBot-mehr-.html
    but 1) it is a german forum and 2) the link to the config seems to be offline… maybe “michael” would upload it again if you ask him…

  8. Martijn Vermaat Jan 28

    Visiting links appearing in MSN Messenger messages would be very bad practice in my opinion.

    As far as honouring the robots.txt directives concerns, my experience with MSNbot is that it is actually doing this right, so I’m suprised you say it ignores the robots.txt on your server.

  9. Adam Rosi-Kessel Jan 28

    Actually, on closer inspection, I discovered that my htaccess file was blocking access even to robots.txt to robots, so the MSNBot likely got an “access denied” error in trying to request the robots.txt file. I suppose the behavior is not quite as bad, then. I’ve fixed it now.

  10. Tuukka Hastrup Jan 28

    Perhaps you already thought about this, but if you have outgoing links from that secret page, the visitors’ browsers have leaked the URI to the target sites as the http referrer address.

  11. niol Jan 28

    Your story reminded me of this other story

  12. Adam Rosi-Kessel Jan 28

    Just to clarify one thing: I realize there is no way for this site to be truly “secret” without authentication; it’s not particularly sensitive stuff, I just wanted to keep it out of search engines.

  13. jab Jan 28

    The culprit for this sort of thing is often referrer logs. Somebody visits your page, then visits website B. The browser reports your URL to the website B’s server using an HTTP referrer field. Website B for some reason exposes its log files to the internet, which get crawled by msnbot spider. Happens surprisingly often – ‘secret’ URLs end up in search engines all the time through this method.

  14. UG Jan 28

    Maybe it’s random.

Leave a Reply

(Markdown Syntax Permitted)