Kelvin Tan - Solr/Elasticsearch Consultant - A simple API-friendly crawler

Solr - Elasticsearch - Big Data

A simple API-friendly crawler

Posted by Kelvin on 01 Dec 2006 at 01:40 am | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Alright. I know I've blogged about this before. Well, I'm revisiting it again.

My sense is that there's a real need for a simple crawler which is easy to use as an API and doesn't attempt to be everything to everyone.

Yes, Nutch is cool, but I'm so tired of fiddling around with configuration files, the proprietary fileformats, and the filesystem-dependence of plugins. Also, crawl progress reporting is poor unless you're intending to be parsing log files.

Here are some thoughts on what a simple crawler might look like:

Download all pages in a site


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setOutput(new SaveToDisk(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.setScope(new HostScope(url));
    c.start();

Download all urls from a file (depth 1 crawl)


    SimpleCrawler c = new SimpleCrawler();
    c.setMaxConnectionsPerHost(5);
    c.setIntervalBetweenConsecutiveRequests(1000);
    c.addURLs(new File(file));
    c.setLinkExtractor(null);
    c.setOutput(new DirectoryPerDomain(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

Page through a search results page via regex


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setLinkExtractor(new RegexLinkExtractor(regex));
    c.setOutput(new SaveToDisk(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

Save to nutch segment for compatibility


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setOutput(new NutchSegmentOutput(segmentdir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

I'm basically trying to find the sweet-spot between Commons HttpClient, and a full-blown crawler app like Nutch.

Thoughts?

4 Comments »

4 Responses to “A simple API-friendly crawler”

Eelco on 13 Dec 2006 at 4:07 pm

Hi Kelvin,

I completely agree with you that there's a need for an API like you describe and also feel the pain you describe in working with Nutch. The problem IMHO is that something that's "easy to use" is in a way exactly the same as "everything to everyone". To create an easy API you'll often need a pretty powerful crawler. A request may be put simply, but performing it is a completely different matter.

I thing most of the ideas you describe can be implemented in Nutch. Some functions are just wrappers for configuration changes, but a couple of things might require putting some functionality that's now only available from the commandline in a framework through which different stages of the Nutch crawler can communicate. The way to communicatie might not be trivial, although I'm not confident enough of my Java skills to make a solid judgement on that. Such a framework might also be the place where progresslisteners can be implemented.

–Eelco
Eelco on 28 Dec 2006 at 4:29 pm

Hi Kelvin,

I posted a comment to this post on your 'old' site. Did you remove it intentionally or by accident?
Eelco on 28 Dec 2006 at 4:30 pm

Nevermind 😉 Because I posted I can see it's in moderation…
Kelvin on 01 Jan 2007 at 3:59 pm

Very sorry about the comment snafu. I only *JUST* realized what a huge comment moderation queue i have (99.999% of which are spam). Just spent the last 2 hours cleaning things up!

Eelco on 13 Dec 2006 at 4:07 pm

Hi Kelvin,

I completely agree with you that there's a need for an API like you describe and also feel the pain you describe in working with Nutch. The problem IMHO is that something that's "easy to use" is in a way exactly the same as "everything to everyone". To create an easy API you'll often need a pretty powerful crawler. A request may be put simply, but performing it is a completely different matter.

I thing most of the ideas you describe can be implemented in Nutch. Some functions are just wrappers for configuration changes, but a couple of things might require putting some functionality that's now only available from the commandline in a framework through which different stages of the Nutch crawler can communicate. The way to communicatie might not be trivial, although I'm not confident enough of my Java skills to make a solid judgement on that. Such a framework might also be the place where progresslisteners can be implemented.

–Eelco

Eelco on 28 Dec 2006 at 4:29 pm

I posted a comment to this post on your 'old' site. Did you remove it intentionally or by accident?

Eelco on 28 Dec 2006 at 4:30 pm

Nevermind 😉 Because I posted I can see it's in moderation…

Kelvin on 01 Jan 2007 at 3:59 pm

Very sorry about the comment snafu. I only *JUST* realized what a huge comment moderation queue i have (99.999% of which are spam). Just spent the last 2 hours cleaning things up!

Supermind Search Consulting Blog Solr - Elasticsearch - Big Data