Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about crawling

Our Crawler Todo List

Posted by Kelvin on 25 Aug 2005 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, crawling, programming

In the order in which these are jumping off my brain (read: no order whatsoever):

  • If-modified-since
    OC already has the basic infrastructure in place to implement conditional downloading of pages based on the If-modified-since HTTP header. This just needs to be implemented in the Http and HttpResponse classes.
  • Re-use socket connections even for redirects
    Right now, redirects are always opening new socket connections, even if the redirected URL is on the same host as the original. There's no reason why this should be so.
  • URL duplication is only checked from a fetcherthread's unfetched URLs and its already-fetched URLs, but not the URLs currently being fetched. This is a problem that will allow duplicates to sneak in.
  • Customize per-host crawling params
    If host parameters such as HTTP version, response time and number of pages can be captured and stored, the crawler might be able to modify its interaction with the server accordingly. For example, by choosing to pipeline less requests for slower hosts.
  • Use Commons HttpClient as alternative HTTP implementation
    Although HttpClient doesn't support pipelining, it does provide connection persistence by default. It would be interesting to compare speed differences between this and the pipelined version. Furthermore, HttpClient is much more robust in handling the various nuances webservers tend to throw our way (Andrzej has documented these in the Nutch Jira)
  • Bot-trap detection
    As mentioned in Limitations of OC, bot-traps can be fatal for OC.
  • Checkpointing
    On long crawls, schedule a checkpoint every x minutes so that the crawler can be easily resumed from the last checkpoint in the case of a hang. I don't think this will be difficult to implement, if abit routine (read: boring).
  • Runtime management (via JMX?)
    The Heritrix project has taken the JMX route in incorporating runtime management into their crawler. I don't know enough about JMX to decide if this is a good idea, but I have heard that JMX is icky. I suppose its either JMX or client-server? These 2 seem to be the only 2 options I have if I want to build a GUI to OC.
  • Implement ExtendibleHashFile and a fixed-length optimized version of SequenceFile/MapFile.

Limitations of OC

Posted by Kelvin on 19 Aug 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Follow-up post to some Reflections on modifying the Nutch crawler.

The current implementation of Our Crawler (OC) has the following limitations:

  1. No support for distributed crawling.
    I neglected to mention that by building the fetchlist offline, the Nutch Crawler (NC) has an easier job splitting the crawling amongst different crawl servers. Furthermore, because the database of fetched URLs (WebDB) is also modified offline, its real easy to check if a URL has already been fetched.

    Since both fetchlist building and fetched URL DB is modified online in OC, the multiple crawl server scenario complicates things somewhat. My justification to omit this in the initial phase at least, is that for most constrained crawls, a single crawl server (with multiple threads) is the most probable use case, and also most likely sufficient.

  2. No load-balancing of URLs amongst threads.
    Currently URL distribution is excessively simple (url.getHost().hashCode() % numberOfThreads), and the only guarantee the Fetcher makes, is that every URL from the same host will go to the same thread (an important guarantee, since each fetcherthread maintains its own fetchlist and database of fetched URLs). This also means that its pointless using 20 threads to crawl 3 hosts (check out the previous post if you're not clear why)

    However, when some hosts are significantly larger than others, then its highly probable that the number of pages each thread has to fetch is uneven, resulting in sub-optimal fetch times. This calls for a more sophisticacted method of assigning URLs to threads, but still maintaining the thread-host contract.

  3. No bot-trap detection.
    With NC, since the depth of the crawl (determined by the number of iterations the fetcher is run) is limited, bot-traps (intentional or otherwise) aren't a major concern.

    When letting OC loose on a site, however, bot-traps can be a big problem because OC will continue to run as long as there are still items in the fetchlist.

  4. Disk writes in multiple threads may nullify SequenceFile performance gains
    Each thread maintains its own database of fetched URLs (via a MapFile). When using alot of threads, its quite likely that multiple threads will be writing to disk at once. To be honest, I don't quite know enough about this to know if its a problem, but depending on hardware and HD utilization, I think its possible that the disk heads may end up jumping around to satisfy the simultaneous writes? If so, SequenceFile's premise of fast sequential write-access is no longer valid.

These are what I can think off right now. Will update this list as I go along.

« Previous Page