Improving Nutch for constrained crawls
Posted by Kelvin on 18 Jul 2005 at 11:16 pm | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch
A heads-up to anyone who is using Nutch for anything but whole-web crawling..
I'm about 70% through patching Nutch to be more usable for constrained crawling. It is basically a partial re-write of the whole fetching mechanism, borrowing large chunks of code here and there.
Features include:
– Customizable seed inputs, i.e. seed a crawl from a file, database, Nutch FetchList, etc
– Customizable crawl scopes, e.g. crawl the seed URLs and only the urls within their domains. (this can already be manually accomplished with RegexURLFilter, but what if there are 200,000 seed URLs?), or crawl seed url domains + 1 external link (not possible with current filter mechanism)
– Online fetchlist building (as opposed to Nutch's offline method), and customizable strategies for building a fetchlist. The default implementation gives priority to hosts with a larger number of pages to crawl. Note that offline fetchlist building is ok too.
– Customizable fetch output mechanisms, like output to file, to segment, or even not at all (if we're just implementing a link-checker, for example)
– Fully utilizes HTTP 1.1 features (see below)
My aim is to be fully compatible with Nutch as it is, i.e. given a Nutch fetchlist, the new crawler can produce a Nutch segment. However, if you don't need that at all, and are just interested in Nutch as a crawler, then that's ok too!
Nutch and HTTP 1.1
Nutch as it is, does not utilize HTTP 1.1's connection persistence and request pipelining features which would significantly cut down crawling time when crawling a small number of hosts extensively.
To test this hypothesis out, I modified Http and HttpResponse to use HTTP 1.1 features, and fetch 20 URLs from the same host repeatedly (15 requests were pipelined per socket connection, which required a total of 2 separate connections to the server). I repeated this test with Nutch's Http (protocol-http). After a couple of runs, I was convinced that at least in my test-case, the performance improvement of the optimized Http class was between 3-4x faster than Nutch's Http version.
I believe the reasons for this speed-up are:
1. Reduced connection overhead (2 socket connections vs 20)
2. Reduced wait-time between requests (once vs 19 times). This is the fetcher.server.delay value in NutchConf.
3. Request pipelining, (not waiting for a request to produce a response before sending the next request)
Note that this is a contrived example of course, but not one that is improbable when running crawls of large domains, and large number of fetcher threads.
I've still a pretty long way to go in terms of testing, but I think a good portion of the difficult bits have already been ironed out. When I'm satisfied that its stable, I'll release a version for download and feedback.
16082005 Update
You might be interested in Reflections on modifying the Nutch crawler
One Response to “Improving Nutch for constrained crawls”
Hi,
Really nice hearing about your work. Have you made any progress on it? I would be very interested in seeing what performance gains you achieved.
Cheers,
Gonçalo