Inside Our Crawler
Posted by Kelvin on 25 Aug 2005 at 09:11 pm | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling, work
Inspired by http://wiki.apache.org/nutch/DissectingTheNutchCrawler
I guess the contents of this post will eventually make it to javadocs. Note that Our Crawler (OC) is really not so different from Nutch Crawler (NC). This document highlights the main differences, as well as important classes.
CrawlTool
CrawlTool is the point of entry to OC. It doesn't do very much really, just calls Fetcher.run().
Fetcher
The Fetcher doesn't do very much either. It starts the FetcherThreads, and provides them SeedURLs from a CrawlSeedSource. Its 2 other main responsibilies are: distributing URLs amongst threads, and reporting FetcherStatus. In a minor design quirk, the PostFetchProcessor is also found in Fetcher and not in FetcherThread instances, thus imposing the requirement on PostFetchProcessors to be thread-safe. The rationale behind this, is to be able to write to a single Nutch segment, instead of requiring a x-way post-fetch segment merge where x is the number of fetcher threads (I haven't put much thought into this. If post-fetch processing is substantial, then it makes more sense to make this a per-thread thing)
FetcherThread
A FetcherThread has the following high-level responsibilities:
- Maintain FetchList and db of fetchedurls for a set of urls (as determined by Fetcher's url distribution strategy). FetcherThread is also responsible for avoiding duplicates between URLs (whether fetched, parsed or output)
- Delegate the actual URL downloading to Http and HttpResponse classes
- Process the outcome of downloading, which primarily has the following steps:
- Parse the HTML
- Extract links from parsed page and run each through FetchListScope, adding to relevant thread's link queue (for subsequent adding into fetchlist) if link is allowed
- Run fetch output through PostFetchScope, passing to PostFetchProcessor if allowed
FetcherThread periodically moves items from its link queue to its fetchlist. The link queue is the thread-safe holding area where other fetcher threads add URLs to be fetched.
A note about the PostFetchProcessor : I know it sounds weird :-).
Initially I called it something like PageOutputter or something, which leaves an equally bad taste in my mouth. I wanted to capture the idea of something that happens _after_ a url/page is downloaded, whether saving to Nutch segment, sending an email to someone every 100 downloaded pages or triggering a backup.
FetchList
Now on to the fetchlist system, one of the biggest differences between NC and OC. The main actors in are FetchList and HostQueue. Unlike Nutch where a FetchList is a sequence of URLs, our FetchList is a sequence of HostQueues, which are in turn a sequence of URLs with the same host. The FetchList also manages the server politeness policy (unlike Nutch where this is done in the Http class).
Different implementations of FetchList may choose different strategies of how to go about prioritizing certain hosts/URLs over others. This process can even be randomized (in which case, it somewhat simulates the Nutch fetchlist, although Nutch's fetchlist is deterministic and OC's DefaultFetchList is NOT).
LastModifiedDB and FetchedURLs
These are both new to OC, even though the WebDB serves as roughly the equivalent. Javadocs should be sufficient for understanding the roles of these 2 classes. One thing to note about FetchedURLs: to save space, the current implementation does _not_ save the actual URL, but rather a 64-bit checksum.
Scopes and Filters
A Scope consists of zero or more filters, where given an input, each filter replies ALLOW, REJECT or ABSTAIN. Self-explanatory. When all filters in a scope abstain, then the scope's allowByDefault value kicks in (also used when a scope has no filters).
The different scopes in use are: FetchListScope, ParseScope and PostFetchScope.
Comments Off on Inside Our Crawler