Using Nutch for constrained crawling
Posted by Kelvin on 08 Jul 2005 at 04:04 pm | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch
My experience with Nutch so far is that its not ideal for crawling, say, 50,000 specific sites, as opposed to whole-web crawling.
Here's why I think so:
- No built-in interfaces for multiple seed URLs. The only ways to input seed URLs are via a text file, and a downloaded DMOZ dump. Extra steps have to be taken to, say, dump a database table to text, rather than a direct link between Nutch and the database. Of course, this is a very minor issue.
- Not possible (without major hacking) to implement crawl scopes, such as, crawl this entire domain, or crawl this entire domain plus one link outside of the domain.
- Needs patching to change the definition of a "host". This is important because some sites with different subdomains are really served from the same server. For example, x.foo.com and y.foo.com are considered different hosts in Nutch, but they can very well be served from the same server, hence, by making Nutch using the top-level domain, crawling will be more polite to this server.
- No easy way to change fetchlist building strategy, which is currently optimized for whole-web crawling. This makes the assumption that there is a large distribution of distinct hosts, and hence a randomized spread of URLs is sufficient for being polite to servers (in not hammering them). This, however, is not the case with site-specific crawls (especially in crawls where pages outside of seed domains are not crawled). As far as I can tell, fetching also does not take advantage of HTTP 1.1's ability to download multiple pages per connection.
- Needs patching to support custom fetch output (or none at all). For example, an application which wants simply to check if pages are 404ed, will implement a no-op fetch output.
- Needs patching to support post-fetch (non-output) operations. For example, if a fetched page contains a certain keyword, add an entry to database table.
- URL pattern/Domain-specific operations, such as, if URL contains "?", don't parse the page for outlinks.
Well, here's a start at least. I'm not whining, I'm just listing down some ideas for possible improvements.
Comments Off on Using Nutch for constrained crawling