Our Crawler Todo List
Posted by Kelvin on 25 Aug 2005 at 07:47 pm | Tagged as: crawling, programming, Lucene / Solr / Elasticsearch / Nutch
In the order in which these are jumping off my brain (read: no order whatsoever):
-
If-modified-since
OC already has the basic infrastructure in place to implement conditional downloading of pages based on the If-modified-since HTTP header. This just needs to be implemented in the Http and HttpResponse classes. -
Re-use socket connections even for redirects
Right now, redirects are always opening new socket connections, even if the redirected URL is on the same host as the original. There's no reason why this should be so. - URL duplication is only checked from a fetcherthread's unfetched URLs and its already-fetched URLs, but not the URLs currently being fetched. This is a problem that will allow duplicates to sneak in.
-
Customize per-host crawling params
If host parameters such as HTTP version, response time and number of pages can be captured and stored, the crawler might be able to modify its interaction with the server accordingly. For example, by choosing to pipeline less requests for slower hosts. -
Use Commons HttpClient as alternative HTTP implementation
Although HttpClient doesn't support pipelining, it does provide connection persistence by default. It would be interesting to compare speed differences between this and the pipelined version. Furthermore, HttpClient is much more robust in handling the various nuances webservers tend to throw our way (Andrzej has documented these in the Nutch Jira) -
Bot-trap detection
As mentioned in Limitations of OC, bot-traps can be fatal for OC. -
Checkpointing
On long crawls, schedule a checkpoint every x minutes so that the crawler can be easily resumed from the last checkpoint in the case of a hang. I don't think this will be difficult to implement, if abit routine (read: boring). -
Runtime management (via JMX?)
The Heritrix project has taken the JMX route in incorporating runtime management into their crawler. I don't know enough about JMX to decide if this is a good idea, but I have heard that JMX is icky. I suppose its either JMX or client-server? These 2 seem to be the only 2 options I have if I want to build a GUI to OC. - Implement ExtendibleHashFile and a fixed-length optimized version of SequenceFile/MapFile.
Comments Off on Our Crawler Todo List