Kelvin Tan - Solr/Elasticsearch Consultant - Reflections on modifying the Nutch crawler

Posts about programming

Reflections on modifying the Nutch crawler

Posted by Kelvin on 16 Aug 2005 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, work, programming

The code for my modifications to the Nutch-based fetcher is at alpha quality, meaning that it compiles, and I've tested it on smallish (<100 pages) sites. There are some unit tests written, but not as extensive as I'd like.

Some thoughts:

Nutch crawler (NC) uses an offline fetchlist building strategy, whilst our crawler (OC) supports both online as well as offline fetchlist building. Fetchlist building is basically the process of determining which URL to crawl next. In Nutch, the fetchlist building basically goes like this:
1. Inject seed urls into database of urls
2. Generate fetchlist
3. Fetch urls in fetchlist
4. Parse fetched pages for links, adding them into database of urls selectively (depending on url filter rules)
5. Rank fetched pages (based on PageRank)
6. Generate fetchlist using top x pages
7. Repeat steps 3 – 6

Since step 5 is both time-consuming and computationally-intensive, NC fetchlist building has to be an offline step. (NB: Breadth-first crawling can possibly be just as effective as PageRank-based crawling).

In crawls where the number of fetched pages is limited and the order in which the pages are fetched doesn't matter, then an offline fetchlist building strategy offers little or no advantage over an online one.
Crawlers should be polite and not hammer a web server repeatedly with requests, so the crawler has to implement some kind of server politeness policy. The accepted norm is to wait a period of time between consecutive requests to the same server. To maximize crawl speed (and be polite), the crawler should try to download from as many different hosts at the same time as possible.

Both of Nutch's current Http plugins (protocol-http and protocol-httpclient) are implemented in such a way that optimum crawl times can only be achieved when
- the order in which the URLs are downloaded is sufficiently random
- the set of hosts is sufficiently large
- the set of hosts is well distributed
In NC, this currently comprises of:
1. Generating an MD5 hash of each URL, then sorting fetchlist entries by this hash. Since a single character change in the url will result in a wildly different hash, this approach does a good job of satisfying the requirement that URLs be randomized. The other 2 requirements (large and well-distributed set of hosts) are satisfied by the very definition of the WWW.
2. Enforcing a fixed delay between consecutive requests to the same webserver (this is the fetcher.server.delay config variable).

In the case of constrained crawling, when either the requirement of size or distribution of hosts is not satisifed, its not difficult to see how crawl time is severely hampered, i.e. if there are too few hosts, or if the distribution of URLs is skewed towards a couple of large hosts, the crawl time will always be grossly sub-optimal.

OC's default fetchlist prioritizes hosts based on the number of unfetched pages they contain. This satisfies the pre-requisite for optimal crawl speed, and minimizes the chance of having "backed-up" urls from the large hosts which take a long time to fetch where the number of fetching threads gets reduced to the number of outstanding hosts there are. OC also waits a period of time between consecutive requests, and its also possible to vary this time depending on the time taken to download a URL (slow servers can mean overloaded server, or they can mean a slow network).

Note that OC's fetchlist is an interface which means that Nutch's offline fetchlist building can be replicated by:
1. Seed crawl from file (same as nutch)
2. When crawl finishes, run Nutch fetchlist building steps
3. Implement a simple fetchlist which reads entries from a Nutch fetchlist (.fl)
4. Repeat steps 2 and 3.

Actually, the current codebase already contains the necessary classes to do this.
Java needs a library of external memory algorithms and data structures (I'm working on this now). Examples from Nutch are SequenceFile, MapFile and ArrayFile, as well as the external mergesort algorithm. Other examples are B+-tree, insertion sort for nearly sorted files, buffer tree, heap file, etc. The list goes on.

Comments Off on Reflections on modifying the Nutch crawler

DNS Resolution in Java

Posted by Kelvin on 06 Aug 2005 | Tagged as: programming

As alluded to by several other papers on crawling, DNS resolution in Java is not very performant and usually a bottleneck.

http://www.buzzsurf.com/java/dns/ and http://www.xbill.org/dnsjava/ seem to be a viable alternative.

Comments Off on DNS Resolution in Java

Basis – Infrastructure for Java projects

Posted by Kelvin on 04 Aug 2005 | Tagged as: work, programming

Basis is an open-source project providing commonly-required components such as user management, login/authentication, role-based access management/authorization and Hibernate support classes.

The data models are derived from The Data Model Resource Book by Len Silverton and adapted for Hibernate.

Unlike other open-source projects, Basis was never meant to be used as-is, but rather as a starting point for your individual project needs.

Current version : 0.1-alpha
Notes: 01-alpha is complete, but not release-quality. Still, do take it for a trial run.

Downloads
Download the latest version basis-0.1-alpha.zip
View javadoc documentation

License
Basis is released under the BSD license (includes software from the Apache foundation).

Credits
Big thanks to Gesakon GmBH for sponsoring development of Basis, and generously allowing it to be open-sourced.

TODO:
There is quite some code providing a web-based UI using WebWork/Velocity which is not ready for release yet.
Test cases exist, but they're pretty out-dated!

Comments Off on Basis – Infrastructure for Java projects

Improving Nutch for constrained crawls

Posted by Kelvin on 18 Jul 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch

A heads-up to anyone who is using Nutch for anything but whole-web crawling..

I'm about 70% through patching Nutch to be more usable for constrained crawling. It is basically a partial re-write of the whole fetching mechanism, borrowing large chunks of code here and there.

Features include:
– Customizable seed inputs, i.e. seed a crawl from a file, database, Nutch FetchList, etc
– Customizable crawl scopes, e.g. crawl the seed URLs and only the urls within their domains. (this can already be manually accomplished with RegexURLFilter, but what if there are 200,000 seed URLs?), or crawl seed url domains + 1 external link (not possible with current filter mechanism)
– Online fetchlist building (as opposed to Nutch's offline method), and customizable strategies for building a fetchlist. The default implementation gives priority to hosts with a larger number of pages to crawl. Note that offline fetchlist building is ok too.
– Customizable fetch output mechanisms, like output to file, to segment, or even not at all (if we're just implementing a link-checker, for example)
– Fully utilizes HTTP 1.1 features (see below)

My aim is to be fully compatible with Nutch as it is, i.e. given a Nutch fetchlist, the new crawler can produce a Nutch segment. However, if you don't need that at all, and are just interested in Nutch as a crawler, then that's ok too!

Nutch and HTTP 1.1
Nutch as it is, does not utilize HTTP 1.1's connection persistence and request pipelining features which would significantly cut down crawling time when crawling a small number of hosts extensively.

To test this hypothesis out, I modified Http and HttpResponse to use HTTP 1.1 features, and fetch 20 URLs from the same host repeatedly (15 requests were pipelined per socket connection, which required a total of 2 separate connections to the server). I repeated this test with Nutch's Http (protocol-http). After a couple of runs, I was convinced that at least in my test-case, the performance improvement of the optimized Http class was between 3-4x faster than Nutch's Http version.

I believe the reasons for this speed-up are:
1. Reduced connection overhead (2 socket connections vs 20)
2. Reduced wait-time between requests (once vs 19 times). This is the fetcher.server.delay value in NutchConf.
3. Request pipelining, (not waiting for a request to produce a response before sending the next request)

Note that this is a contrived example of course, but not one that is improbable when running crawls of large domains, and large number of fetcher threads.

I've still a pretty long way to go in terms of testing, but I think a good portion of the difficult bits have already been ironed out. When I'm satisfied that its stable, I'll release a version for download and feedback.

16082005 Update
You might be interested in Reflections on modifying the Nutch crawler

1 Comment »

Useful links on HTTP Pipelining

Posted by Kelvin on 12 Jul 2005 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

Am implementing HTTP Pipelining for Nutch. Some useful links:

http://www.oaklandsoftware.com/product_16compare.html
http://www.mozilla.org/projects/netlib/http/pipelining-faq.html
http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html
http://www.jmarshall.com/easy/http/

Update
Some thoughts working on HTTP pipelining:

Its a pity Commons HttpClient doesn't have HTTP request pipelining implemented. I tried my hand at it, but gave up after 2 hours. In some ways, Commons HttpClient seems abit of an overkill, with all its glorious complexity compared to Nutch's Http and HttpResponse classes. OTOH, there's proper cookie and authentication support, amongst a host of other features. Innovation.ch's HttpClient provides request pipelining, and most of Commons HttpClient's features, but it isn't actively developed anymore, and I'm abit hesitant about using it. Nonetheless, I have to say its code is very well documented!

I spent about 3 hours knocking my head against the wall on a bug, before I realized that mixing HTTP 1.0 and 1.1 servers is quite unhealthy when attempting request pipelining. I also had to remember that HTTP 1.0 doesn't support Keep-Alive (aka connection persistence) by default. Furthermore, sending the header Connection: Keep-Alive would cause HTTP 1.0 proxies to fail unpredictably. I decided to steer well clear of the mess and simply _not_ pipeline or reuse HTTP 1.0 sockets.

Comments Off on Useful links on HTTP Pipelining

Using Nutch for constrained crawling

Posted by Kelvin on 08 Jul 2005 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, programming

My experience with Nutch so far is that its not ideal for crawling, say, 50,000 specific sites, as opposed to whole-web crawling.

Here's why I think so:

No built-in interfaces for multiple seed URLs. The only ways to input seed URLs are via a text file, and a downloaded DMOZ dump. Extra steps have to be taken to, say, dump a database table to text, rather than a direct link between Nutch and the database. Of course, this is a very minor issue.
Not possible (without major hacking) to implement crawl scopes, such as, crawl this entire domain, or crawl this entire domain plus one link outside of the domain.
Needs patching to change the definition of a "host". This is important because some sites with different subdomains are really served from the same server. For example, x.foo.com and y.foo.com are considered different hosts in Nutch, but they can very well be served from the same server, hence, by making Nutch using the top-level domain, crawling will be more polite to this server.
No easy way to change fetchlist building strategy, which is currently optimized for whole-web crawling. This makes the assumption that there is a large distribution of distinct hosts, and hence a randomized spread of URLs is sufficient for being polite to servers (in not hammering them). This, however, is not the case with site-specific crawls (especially in crawls where pages outside of seed domains are not crawled). As far as I can tell, fetching also does not take advantage of HTTP 1.1's ability to download multiple pages per connection.
Needs patching to support custom fetch output (or none at all). For example, an application which wants simply to check if pages are 404ed, will implement a no-op fetch output.
Needs patching to support post-fetch (non-output) operations. For example, if a fetched page contains a certain keyword, add an entry to database table.
URL pattern/Domain-specific operations, such as, if URL contains "?", don't parse the page for outlinks.

Well, here's a start at least. I'm not whining, I'm just listing down some ideas for possible improvements.

Comments Off on Using Nutch for constrained crawling

Liberating your Amazon Wishlist

Posted by Kelvin on 01 Jul 2005 | Tagged as: programming

Amazon's wishlist feature is a great feature for managing a list of books I wanna get, and how much I should pay for them if I head to a bookstore. This, coupled with Amazon's Web Services API, makes for interesting possibilites.

Access it here.

How to find your Amazon wishlist id:

View your wishlist in your browser, then click on the "Your Wish Lists" link right at the top of the page.
A page will be displayed with your existing lists, plus a form titled "Create a new Wish List".
Right-click on the wishlist link and copy it to clipboard.
Paste into a text editor like Notepad, and the part of the link after "id=…." is the wishlist id. It should look something like 1YY1OKRQG2EU5.

Credits go to Kevin Donahue at http://s92507495.onlinehome.us/mt/mt-tb.cgi/885 for giving me the idea..

Comments Off on Liberating your Amazon Wishlist

DHTML toolkits

Posted by Kelvin on 18 Jan 2005 | Tagged as: life, programming

Link log of DHTML toolkits. Most promising seems to be nwidgets/netwindows. Articles about it can be found here and here.

http://koranteng.blogspot.com/2004/07/on-rich-web-applications-alphablox-and.html is interesting read about building advanced web interfaces, http://scottandrew.com/weblog/dhtmllibs has annotated list of dhtml libs, http://www.dithered.com/javascript/index.html has some interesting libs,.

Comments Off on DHTML toolkits

Remote scripting magic

Posted by Kelvin on 06 Jan 2005 | Tagged as: programming

18012005 update: XmlHttpRequest seems to be a "better" way of doing things than remote scripting. http://jpspan.sourceforge.net/wiki/doku.php?id=javascript:xmlhttprequest and http://jpspan.sourceforge.net/wiki/doku.php?id=javascript:remotescripting enumerate the differences, as well as provide a comprehensive list of links.

http://www.peej.co.uk/articles/rest.html also advocates XmlHttpRequest over RPC-style calls. http://www.peej.co.uk/articles/rich-user-experience.html has more about this.

http://www.sitepoint.com/blog-post-view.php?id=171725 contains some insights on using XMLHttpRequest with Javascript closures.

Linklog of remote scripting resources:

Remote Scripting with Javascript, IFrames and PHP
http://www.phpbuilder.com/columns/vance20040507.php3

Remote scripting with javascript
http://www.dotvoid.com/view.php?id=13 and http://www.dotvoid.com/view.php?id=28

Cool LiveSearch feature
http://blog4.bitflux.ch/wiki/LiveSearch

Serverside autocompletion
http://www.sitepoint.com/blog-post-view.php?id=180147

Separating browser from resource
http://www.sitepoint.com/blog-post-view.php?id=176885

XMLHttpRequest
http://jibbering.com/2002/4/httprequest.html

Atom
http://www.atomenabled.org/

Javascript AtomAPI Client using XmlHttpRequest
http://www.isolani.co.uk/blog/atom/JavascriptAtomApiClientUsingXmlHttpRequest

jpspan, hooking up PHP and javascript
http://jpspan.sourceforge.net/wiki/doku.php

Qwad framework
http://www.qwad.com.au/code/doku.php?id=qwad_framework

JSRS
http://www.ashleyit.com/rs/main.htm

Somewhat dated article on DW about client-server javascript comms
http://www-106.ibm.com/developerworks/web/library/wa-exrel/?dwzone=web

JavaScript O Lait, providing javascript XML/RPC impl
http://jsolait.net/

Crouching javascript
http://www.sitepoint.com/blog-post-view.php?id=191776 and http://www.sitepoint.com/blog-post-view.php?id=192138

Apple Developer Connection about remote scripting
http://developer.apple.com/internet/webcontent/iframe.html

Pushlets
http://www.pushlets.com/

ECMAScript library acting as a cross-browser wrapper for native XML APIs
http://sarissa.sourceforge.net/doc/

Side-note: PHP developers are right out on the cutting edge of Javascript and DOM manipulation, getting standard browsers to do amazing stuff. Where are the Java folks?

Comments Off on Remote scripting magic

Designing Usable APIs

Posted by Kelvin on 05 Jan 2005 | Tagged as: programming

Carlos Perez writes at http://www.manageability.org/blog/archive/20030514%23human_factors_design_of_apis and http://www.manageability.org/blog/stuff/even-more-wisdom-designing-usable-apis/view about designing usable APIs.

Comments Off on Designing Usable APIs

« Previous Page — Next Page »

Supermind Search Consulting Blog Solr - Elasticsearch - Big Data

Posts about programming

Supermind Search Consulting Blog
Solr - Elasticsearch - Big Data