Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about Lucene / Solr / Elasticsearch / Nutch

OC and Nutch MapReduce

Posted by Kelvin on 15 Sep 2005 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, crawling, work, programming

(or what's next for OC)…

I've received a couple of emails about what the future of OC vis-a-vis incorporating into Nutch codebase, the upcoming MapReduce merge into trunk, etc.

My thoughts are:

  1. When MapReduce is merged into trunk, I'll make appropriate changes to OC to support MapReduce.
  2. This MapReduce-compatible OC will be offered to the Nutch codebase. As of today, I've removed most usages of JDK1.5 features, so the other thing that needs to be removed is Spring Framework dependencies.
  3. I _might_ keep a version of OC which is more experimental and uses Spring, and can operate on a more standalone basis. The advantages (more autonomy over what I can do and not, Spring support) will have to be balanced against the disadvantages (duplication).

27092005 edit
I'm browsing MapReduce sources now, and I've found the complexity of the Fetcher has rather significantly increased. I'm leaning towards maintaining a simple non-mapred fetcher for folks who don't need the MapReduce scalability.

Inside Our Crawler

Posted by Kelvin on 25 Aug 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Inspired by http://wiki.apache.org/nutch/DissectingTheNutchCrawler

I guess the contents of this post will eventually make it to javadocs. Note that Our Crawler (OC) is really not so different from Nutch Crawler (NC). This document highlights the main differences, as well as important classes.

CrawlTool
CrawlTool is the point of entry to OC. It doesn't do very much really, just calls Fetcher.run().

Fetcher
The Fetcher doesn't do very much either. It starts the FetcherThreads, and provides them SeedURLs from a CrawlSeedSource. Its 2 other main responsibilies are: distributing URLs amongst threads, and reporting FetcherStatus. In a minor design quirk, the PostFetchProcessor is also found in Fetcher and not in FetcherThread instances, thus imposing the requirement on PostFetchProcessors to be thread-safe. The rationale behind this, is to be able to write to a single Nutch segment, instead of requiring a x-way post-fetch segment merge where x is the number of fetcher threads (I haven't put much thought into this. If post-fetch processing is substantial, then it makes more sense to make this a per-thread thing)

FetcherThread
A FetcherThread has the following high-level responsibilities:

  1. Maintain FetchList and db of fetchedurls for a set of urls (as determined by Fetcher's url distribution strategy). FetcherThread is also responsible for avoiding duplicates between URLs (whether fetched, parsed or output)
  2. Delegate the actual URL downloading to Http and HttpResponse classes
  3. Process the outcome of downloading, which primarily has the following steps:
    1. Parse the HTML
    2. Extract links from parsed page and run each through FetchListScope, adding to relevant thread's link queue (for subsequent adding into fetchlist) if link is allowed
    3. Run fetch output through PostFetchScope, passing to PostFetchProcessor if allowed

FetcherThread periodically moves items from its link queue to its fetchlist. The link queue is the thread-safe holding area where other fetcher threads add URLs to be fetched.

A note about the PostFetchProcessor : I know it sounds weird :-).
Initially I called it something like PageOutputter or something, which leaves an equally bad taste in my mouth. I wanted to capture the idea of something that happens _after_ a url/page is downloaded, whether saving to Nutch segment, sending an email to someone every 100 downloaded pages or triggering a backup.

FetchList
Now on to the fetchlist system, one of the biggest differences between NC and OC. The main actors in are FetchList and HostQueue. Unlike Nutch where a FetchList is a sequence of URLs, our FetchList is a sequence of HostQueues, which are in turn a sequence of URLs with the same host. The FetchList also manages the server politeness policy (unlike Nutch where this is done in the Http class).

Different implementations of FetchList may choose different strategies of how to go about prioritizing certain hosts/URLs over others. This process can even be randomized (in which case, it somewhat simulates the Nutch fetchlist, although Nutch's fetchlist is deterministic and OC's DefaultFetchList is NOT).

LastModifiedDB and FetchedURLs
These are both new to OC, even though the WebDB serves as roughly the equivalent. Javadocs should be sufficient for understanding the roles of these 2 classes. One thing to note about FetchedURLs: to save space, the current implementation does _not_ save the actual URL, but rather a 64-bit checksum.

Scopes and Filters
A Scope consists of zero or more filters, where given an input, each filter replies ALLOW, REJECT or ABSTAIN. Self-explanatory. When all filters in a scope abstain, then the scope's allowByDefault value kicks in (also used when a scope has no filters).

The different scopes in use are: FetchListScope, ParseScope and PostFetchScope.

Our Crawler Todo List

Posted by Kelvin on 25 Aug 2005 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

In the order in which these are jumping off my brain (read: no order whatsoever):

  • If-modified-since
    OC already has the basic infrastructure in place to implement conditional downloading of pages based on the If-modified-since HTTP header. This just needs to be implemented in the Http and HttpResponse classes.
  • Re-use socket connections even for redirects
    Right now, redirects are always opening new socket connections, even if the redirected URL is on the same host as the original. There's no reason why this should be so.
  • URL duplication is only checked from a fetcherthread's unfetched URLs and its already-fetched URLs, but not the URLs currently being fetched. This is a problem that will allow duplicates to sneak in.
  • Customize per-host crawling params
    If host parameters such as HTTP version, response time and number of pages can be captured and stored, the crawler might be able to modify its interaction with the server accordingly. For example, by choosing to pipeline less requests for slower hosts.
  • Use Commons HttpClient as alternative HTTP implementation
    Although HttpClient doesn't support pipelining, it does provide connection persistence by default. It would be interesting to compare speed differences between this and the pipelined version. Furthermore, HttpClient is much more robust in handling the various nuances webservers tend to throw our way (Andrzej has documented these in the Nutch Jira)
  • Bot-trap detection
    As mentioned in Limitations of OC, bot-traps can be fatal for OC.
  • Checkpointing
    On long crawls, schedule a checkpoint every x minutes so that the crawler can be easily resumed from the last checkpoint in the case of a hang. I don't think this will be difficult to implement, if abit routine (read: boring).
  • Runtime management (via JMX?)
    The Heritrix project has taken the JMX route in incorporating runtime management into their crawler. I don't know enough about JMX to decide if this is a good idea, but I have heard that JMX is icky. I suppose its either JMX or client-server? These 2 seem to be the only 2 options I have if I want to build a GUI to OC.
  • Implement ExtendibleHashFile and a fixed-length optimized version of SequenceFile/MapFile.

Limitations of OC

Posted by Kelvin on 19 Aug 2005 | Tagged as: crawling, work, programming, Lucene / Solr / Elasticsearch / Nutch

Follow-up post to some Reflections on modifying the Nutch crawler.

The current implementation of Our Crawler (OC) has the following limitations:

  1. No support for distributed crawling.
    I neglected to mention that by building the fetchlist offline, the Nutch Crawler (NC) has an easier job splitting the crawling amongst different crawl servers. Furthermore, because the database of fetched URLs (WebDB) is also modified offline, its real easy to check if a URL has already been fetched.

    Since both fetchlist building and fetched URL DB is modified online in OC, the multiple crawl server scenario complicates things somewhat. My justification to omit this in the initial phase at least, is that for most constrained crawls, a single crawl server (with multiple threads) is the most probable use case, and also most likely sufficient.

  2. No load-balancing of URLs amongst threads.
    Currently URL distribution is excessively simple (url.getHost().hashCode() % numberOfThreads), and the only guarantee the Fetcher makes, is that every URL from the same host will go to the same thread (an important guarantee, since each fetcherthread maintains its own fetchlist and database of fetched URLs). This also means that its pointless using 20 threads to crawl 3 hosts (check out the previous post if you're not clear why)

    However, when some hosts are significantly larger than others, then its highly probable that the number of pages each thread has to fetch is uneven, resulting in sub-optimal fetch times. This calls for a more sophisticacted method of assigning URLs to threads, but still maintaining the thread-host contract.

  3. No bot-trap detection.
    With NC, since the depth of the crawl (determined by the number of iterations the fetcher is run) is limited, bot-traps (intentional or otherwise) aren't a major concern.

    When letting OC loose on a site, however, bot-traps can be a big problem because OC will continue to run as long as there are still items in the fetchlist.

  4. Disk writes in multiple threads may nullify SequenceFile performance gains
    Each thread maintains its own database of fetched URLs (via a MapFile). When using alot of threads, its quite likely that multiple threads will be writing to disk at once. To be honest, I don't quite know enough about this to know if its a problem, but depending on hardware and HD utilization, I think its possible that the disk heads may end up jumping around to satisfy the simultaneous writes? If so, SequenceFile's premise of fast sequential write-access is no longer valid.

These are what I can think off right now. Will update this list as I go along.

Reflections on modifying the Nutch crawler

Posted by Kelvin on 16 Aug 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch

The code for my modifications to the Nutch-based fetcher is at alpha quality, meaning that it compiles, and I've tested it on smallish (<100 pages) sites. There are some unit tests written, but not as extensive as I'd like.

Some thoughts:

  1. Nutch crawler (NC) uses an offline fetchlist building strategy, whilst our crawler (OC) supports both online as well as offline fetchlist building. Fetchlist building is basically the process of determining which URL to crawl next. In Nutch, the fetchlist building basically goes like this:
    1. Inject seed urls into database of urls
    2. Generate fetchlist
    3. Fetch urls in fetchlist
    4. Parse fetched pages for links, adding them into database of urls selectively (depending on url filter rules)
    5. Rank fetched pages (based on PageRank)
    6. Generate fetchlist using top x pages
    7. Repeat steps 3 – 6

    Since step 5 is both time-consuming and computationally-intensive, NC fetchlist building has to be an offline step. (NB: Breadth-first crawling can possibly be just as effective as PageRank-based crawling).

    In crawls where the number of fetched pages is limited and the order in which the pages are fetched doesn't matter, then an offline fetchlist building strategy offers little or no advantage over an online one.

  2. Crawlers should be polite and not hammer a web server repeatedly with requests, so the crawler has to implement some kind of server politeness policy. The accepted norm is to wait a period of time between consecutive requests to the same server. To maximize crawl speed (and be polite), the crawler should try to download from as many different hosts at the same time as possible.

    Both of Nutch's current Http plugins (protocol-http and protocol-httpclient) are implemented in such a way that optimum crawl times can only be achieved when

    • the order in which the URLs are downloaded is sufficiently random
    • the set of hosts is sufficiently large
    • the set of hosts is well distributed

    In NC, this currently comprises of:
    1. Generating an MD5 hash of each URL, then sorting fetchlist entries by this hash. Since a single character change in the url will result in a wildly different hash, this approach does a good job of satisfying the requirement that URLs be randomized. The other 2 requirements (large and well-distributed set of hosts) are satisfied by the very definition of the WWW.
    2. Enforcing a fixed delay between consecutive requests to the same webserver (this is the fetcher.server.delay config variable).

    In the case of constrained crawling, when either the requirement of size or distribution of hosts is not satisifed, its not difficult to see how crawl time is severely hampered, i.e. if there are too few hosts, or if the distribution of URLs is skewed towards a couple of large hosts, the crawl time will always be grossly sub-optimal.

    OC's default fetchlist prioritizes hosts based on the number of unfetched pages they contain. This satisfies the pre-requisite for optimal crawl speed, and minimizes the chance of having "backed-up" urls from the large hosts which take a long time to fetch where the number of fetching threads gets reduced to the number of outstanding hosts there are. OC also waits a period of time between consecutive requests, and its also possible to vary this time depending on the time taken to download a URL (slow servers can mean overloaded server, or they can mean a slow network).

    Note that OC's fetchlist is an interface which means that Nutch's offline fetchlist building can be replicated by:
    1. Seed crawl from file (same as nutch)
    2. When crawl finishes, run Nutch fetchlist building steps
    3. Implement a simple fetchlist which reads entries from a Nutch fetchlist (.fl)
    4. Repeat steps 2 and 3.

    Actually, the current codebase already contains the necessary classes to do this.

  3. Java needs a library of external memory algorithms and data structures (I'm working on this now). Examples from Nutch are SequenceFile, MapFile and ArrayFile, as well as the external mergesort algorithm. Other examples are B+-tree, insertion sort for nearly sorted files, buffer tree, heap file, etc. The list goes on.

MARC: msg '[Nutch-dev] Alternatives to SequenceFile for WebDB?'

Posted by Kelvin on 08 Aug 2005 | Tagged as: blogmark, Lucene / Solr / Elasticsearch / Nutch

http://marc.theaimsgroup.com/?l=nutch-developers&m=109958192318128&w=2
Mailing list thread on the various data structure alternatives to SequenceFile

(Via MARC: Mailing list ARChives)

Improving Nutch for constrained crawls

Posted by Kelvin on 18 Jul 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch

A heads-up to anyone who is using Nutch for anything but whole-web crawling..

I'm about 70% through patching Nutch to be more usable for constrained crawling. It is basically a partial re-write of the whole fetching mechanism, borrowing large chunks of code here and there.

Features include:
– Customizable seed inputs, i.e. seed a crawl from a file, database, Nutch FetchList, etc
– Customizable crawl scopes, e.g. crawl the seed URLs and only the urls within their domains. (this can already be manually accomplished with RegexURLFilter, but what if there are 200,000 seed URLs?), or crawl seed url domains + 1 external link (not possible with current filter mechanism)
– Online fetchlist building (as opposed to Nutch's offline method), and customizable strategies for building a fetchlist. The default implementation gives priority to hosts with a larger number of pages to crawl. Note that offline fetchlist building is ok too.
– Customizable fetch output mechanisms, like output to file, to segment, or even not at all (if we're just implementing a link-checker, for example)
– Fully utilizes HTTP 1.1 features (see below)

My aim is to be fully compatible with Nutch as it is, i.e. given a Nutch fetchlist, the new crawler can produce a Nutch segment. However, if you don't need that at all, and are just interested in Nutch as a crawler, then that's ok too!

Nutch and HTTP 1.1
Nutch as it is, does not utilize HTTP 1.1's connection persistence and request pipelining features which would significantly cut down crawling time when crawling a small number of hosts extensively.

To test this hypothesis out, I modified Http and HttpResponse to use HTTP 1.1 features, and fetch 20 URLs from the same host repeatedly (15 requests were pipelined per socket connection, which required a total of 2 separate connections to the server). I repeated this test with Nutch's Http (protocol-http). After a couple of runs, I was convinced that at least in my test-case, the performance improvement of the optimized Http class was between 3-4x faster than Nutch's Http version.

I believe the reasons for this speed-up are:
1. Reduced connection overhead (2 socket connections vs 20)
2. Reduced wait-time between requests (once vs 19 times). This is the fetcher.server.delay value in NutchConf.
3. Request pipelining, (not waiting for a request to produce a response before sending the next request)

Note that this is a contrived example of course, but not one that is improbable when running crawls of large domains, and large number of fetcher threads.

I've still a pretty long way to go in terms of testing, but I think a good portion of the difficult bits have already been ironed out. When I'm satisfied that its stable, I'll release a version for download and feedback.

16082005 Update
You might be interested in Reflections on modifying the Nutch crawler

Useful links on HTTP Pipelining

Posted by Kelvin on 12 Jul 2005 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

Am implementing HTTP Pipelining for Nutch. Some useful links:

http://www.oaklandsoftware.com/product_16compare.html
http://www.mozilla.org/projects/netlib/http/pipelining-faq.html
http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html
http://www.jmarshall.com/easy/http/

Update
Some thoughts working on HTTP pipelining:

Its a pity Commons HttpClient doesn't have HTTP request pipelining implemented. I tried my hand at it, but gave up after 2 hours. In some ways, Commons HttpClient seems abit of an overkill, with all its glorious complexity compared to Nutch's Http and HttpResponse classes. OTOH, there's proper cookie and authentication support, amongst a host of other features. Innovation.ch's HttpClient provides request pipelining, and most of Commons HttpClient's features, but it isn't actively developed anymore, and I'm abit hesitant about using it. Nonetheless, I have to say its code is very well documented!

I spent about 3 hours knocking my head against the wall on a bug, before I realized that mixing HTTP 1.0 and 1.1 servers is quite unhealthy when attempting request pipelining. I also had to remember that HTTP 1.0 doesn't support Keep-Alive (aka connection persistence) by default. Furthermore, sending the header Connection: Keep-Alive would cause HTTP 1.0 proxies to fail unpredictably. I decided to steer well clear of the mess and simply _not_ pipeline or reuse HTTP 1.0 sockets.

Using Nutch for constrained crawling

Posted by Kelvin on 08 Jul 2005 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

My experience with Nutch so far is that its not ideal for crawling, say, 50,000 specific sites, as opposed to whole-web crawling.

Here's why I think so:

  1. No built-in interfaces for multiple seed URLs. The only ways to input seed URLs are via a text file, and a downloaded DMOZ dump. Extra steps have to be taken to, say, dump a database table to text, rather than a direct link between Nutch and the database. Of course, this is a very minor issue.
  2. Not possible (without major hacking) to implement crawl scopes, such as, crawl this entire domain, or crawl this entire domain plus one link outside of the domain.
  3. Needs patching to change the definition of a "host". This is important because some sites with different subdomains are really served from the same server. For example, x.foo.com and y.foo.com are considered different hosts in Nutch, but they can very well be served from the same server, hence, by making Nutch using the top-level domain, crawling will be more polite to this server.
  4. No easy way to change fetchlist building strategy, which is currently optimized for whole-web crawling. This makes the assumption that there is a large distribution of distinct hosts, and hence a randomized spread of URLs is sufficient for being polite to servers (in not hammering them). This, however, is not the case with site-specific crawls (especially in crawls where pages outside of seed domains are not crawled). As far as I can tell, fetching also does not take advantage of HTTP 1.1's ability to download multiple pages per connection.
  5. Needs patching to support custom fetch output (or none at all). For example, an application which wants simply to check if pages are 404ed, will implement a no-op fetch output.
  6. Needs patching to support post-fetch (non-output) operations. For example, if a fetched page contains a certain keyword, add an entry to database table.
  7. URL pattern/Domain-specific operations, such as, if URL contains "?", don't parse the page for outlinks.

Well, here's a start at least. I'm not whining, I'm just listing down some ideas for possible improvements.

Tuning Lucene

Posted by Kelvin on 22 Mar 2005 | Tagged as: work, Lucene / Solr / Elasticsearch / Nutch

Recommended values for mergeFactor, minMergeDocs, maxMergeDocs
http://marc.theaimsgroup.com/?l=lucene-user&m=110235452004799&w=2

Best Practices for Distributing Lucene Indexing and Searching
http://marc.theaimsgroup.com/?l=lucene-user&m=110973989200204&w=2

Optimizing for long queries
http://marc.theaimsgroup.com/?l=lucene-user&m=108840979004587&w=2
http://marc.theaimsgroup.com/?l=lucene-user&m=110969921306938&w=2

ParallellMultiSearcher Vs. One big Index
http://marc.theaimsgroup.com/?l=lucene-user&m=110607674000954&w=2

Lucene Performance Bottlenecks
http://marc.theaimsgroup.com/?l=lucene-user&m=113352444722705&w=2

« Previous PageNext Page »