The New Zealand Web Harvest 2008 Harvests Too Much

October 14, 2008 – 8:36 pm

I was looking through the server logs for this site tonight and I noticed a user-agent that I hadn’t seen before. It turns out that the National Library of NZ is trying to take a snapshot of all New Zealand sites to archive as part of their mandate to preserve cultural history.

I whole-heartedly approve of this effort, so much of our lives are conducted on-line these days that it would be impossible for future generations to understand our lives without seeing how our web pages work. Also, I spent a lot of time making that table of Steven Seagal films and I would like to think that archeologists may one day find it useful.

Unfortunately, the NatLib spider is crawling over parts of my website that it shouldn’t. My WordMap game consists of about 9000 words linked together in an almost infinite number of ways, with the pages generated on the fly. Spidering it serves no purpose, which is why I specifically tell bots not to in my robots.txt file.

The associated web site contains this curious section:

Will you honour the robots.txt protocol?

No.

We realise this may be a contentious decision, and we have given this issue a great deal of thought. However, our current policy is to ignore the robots.txt file and harvest as many files as possible from each website unless we receive a request to do otherwise.

We believe it is best that we ignore robots.txt because we have a responsibility and mandate to preserve the New Zealand internet so that future New Zealanders can experience it just as we do – or as close as is technically possible. However, robots.txt files currently block many URLs. If we were to obey robots.txt we would only get a partial snapshot of the internet.

I can understand where they are coming from; there is probably all sorts of semi-secret juicy stuff hidden behind robots.txt. However, I think they are making a mistake – nothing actually enforces robots.txt rules but sites usually block bots for good reasons.

In the case of WordMap the bot will spider millions of useless pages wasting bandwidth and resources on both the server and at the Library’s end (actually I assume the bot is clever enough to eventually give up, but who knows how long it will take?)

For my own little site this is not the end of the world but I predict some sites will have terrible problems. Back in the day, bots used to regularly play havoc with badly written sites by spidering links that had side effects when accessed. More than one database was cleaned out because somebody foolish wrote a page that listed each record in a table with a link beside each one that said “delete”.

Issues like this are now better understood and modern web pages are not (hopefully) written like this. However there will be plenty of older pages out there the exhibit similar behavior, protected by a robots.txt file that is going to be ignored. It will be interesting to see if any problems make the news.

Related posts:

  1. The HTML5 Video Tag’s Fatal Flaw Back in the day there was no standard way to...
  2. Safari 4 is Pretty Good Safari 4 has been out for a couple of days...
  3. LNGEST WRD TXTD WIT 1 FNGR The question came up at work: what is the longest...
  4. Mobile Safari Does Not Support Flash (and Never Will) Listening to some people, the lack of Flash on the...
  5. WordPress Upgrade Time I have just upgraded to the latest version of WordPress....

Related posts brought to you by Yet Another Related Posts Plugin.

  1. 4 Responses to “The New Zealand Web Harvest 2008 Harvests Too Much”

  2. easy fix: RewriteCond %{HTTP_USER_AGENT} ^NLNZHarvester2008 RewriteRule ^(.*)$ http://tinyurl.com/2w4apm

    By Dave on Oct 15, 2008

  3. Yeah, I could do that (Rick Astley must be archived for future generations!) However it is robots.txt that is supposed to prevent bots, not some mod-rewrite tricks.

    I feel I have honored my end of the bargain in warning bots away from parts of my site that it would be useless to spider; NatLib is not upholding their responsibilities when running their bot.

    By Andrew on Oct 15, 2008

  4. Hi Andrew

    Thanks for the feedback (I know people usually say that snarkily, but I’m really sincere here).

    We’ve just just blogged about the Web Harvest & the pain it’s causing (and the comments we’ve seen). Our intentions really are good – to collect & preserve & make accessible NZ’s digital heritage for people in the future, the same way we do already for books & newspapers & photographs – and we’re trying to respond and fix people’s problems as quickly as we can.

    You can send comments & requests for us to change the crawler’s behaviour to web-harvest-2008@natlib.govt.nz.

    Thanks heaps,

    Courtney

    By Courtney Johnston on Oct 15, 2008

  5. Courtney,

    I can see your reasoning, I just think you have made a “brave” call that will likely cause trouble for some people. But it’s great to see that you guys are being so responsive about any problems and hopefully nothing too bad will happen.

    Aside from that, The National Web Harvest sounds like a fantastic project. Rock on NatLib!

    By Andrew on Oct 16, 2008

Post a Comment

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word