A simple API-friendly crawler
Posted by Kelvin on 01 Dec 2006 at 01:40 am | Tagged as: Lucene / Solr / Elasticsearch / Nutch, crawling, programming
Alright. I know I've blogged about this before. Well, I'm revisiting it again.
My sense is that there's a real need for a simple crawler which is easy to use as an API and doesn't attempt to be everything to everyone.
Yes, Nutch is cool, but I'm so tired of fiddling around with configuration files, the proprietary fileformats, and the filesystem-dependence of plugins. Also, crawl progress reporting is poor unless you're intending to be parsing log files.
Here are some thoughts on what a simple crawler might look like:
Download all pages in a site
SimpleCrawler c = new SimpleCrawler();
c.addURL(url);
c.setOutput(new SaveToDisk(downloaddir));
c.setProgressListener(new StdOutProgressListener());
c.setScope(new HostScope(url));
c.start();
Download all urls from a file (depth 1 crawl)
SimpleCrawler c = new SimpleCrawler();
c.setMaxConnectionsPerHost(5);
c.setIntervalBetweenConsecutiveRequests(1000);
c.addURLs(new File(file));
c.setLinkExtractor(null);
c.setOutput(new DirectoryPerDomain(downloaddir));
c.setProgressListener(new StdOutProgressListener());
c.start();
Page through a search results page via regex
SimpleCrawler c = new SimpleCrawler();
c.addURL(url);
c.setLinkExtractor(new RegexLinkExtractor(regex));
c.setOutput(new SaveToDisk(downloaddir));
c.setProgressListener(new StdOutProgressListener());
c.start();
Save to nutch segment for compatibility
SimpleCrawler c = new SimpleCrawler();
c.addURL(url);
c.setOutput(new NutchSegmentOutput(segmentdir));
c.setProgressListener(new StdOutProgressListener());
c.start();
I'm basically trying to find the sweet-spot between Commons HttpClient, and a full-blown crawler app like Nutch.
Thoughts?
4 Responses to “A simple API-friendly crawler”
Hi Kelvin,
I completely agree with you that there's a need for an API like you describe and also feel the pain you describe in working with Nutch. The problem IMHO is that something that's "easy to use" is in a way exactly the same as "everything to everyone". To create an easy API you'll often need a pretty powerful crawler. A request may be put simply, but performing it is a completely different matter.
I thing most of the ideas you describe can be implemented in Nutch. Some functions are just wrappers for configuration changes, but a couple of things might require putting some functionality that's now only available from the commandline in a framework through which different stages of the Nutch crawler can communicate. The way to communicatie might not be trivial, although I'm not confident enough of my Java skills to make a solid judgement on that. Such a framework might also be the place where progresslisteners can be implemented.
–Eelco
Hi Kelvin,
I posted a comment to this post on your 'old' site. Did you remove it intentionally or by accident?
Nevermind 😉 Because I posted I can see it's in moderation…
Very sorry about the comment snafu. I only *JUST* realized what a huge comment moderation queue i have (99.999% of which are spam). Just spent the last 2 hours cleaning things up!