Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about crawling

Properly unit testing scrapy spiders

Posted by Kelvin on 20 Nov 2014 | Tagged as: crawling, Python

Scrapy, being based on Twisted, introduces an incredible host of obstacles to easily and efficiently writing self-contained unit tests:

1. You can't call reactor.run() multiple times
2. You can't stop the reactor multiple times, so you can't blindly call "crawler.signals.connect(reactor.stop, signal=signals.spider_closed)"
3. Reactor runs in its own thread, so your failed assertions won't make it to the main unittest thread, so test failures will be thrown as assertion errors but unittest doesn't know about them

To get around these hurdles, I created a BaseScrapyTestCase class that uses tl.testing's ThreadAwareTestCase and the following workarounds.

class BaseScrapyTestCase(ThreadAwareTestCase):
	in_suite = False
 
	def setUp(self):
		self.last_crawler = None
		self.settings = get_project_settings()
 
	def run_reactor(self, called_from_suite=False):
		if not called_from_suite and BaseScrapyTestCase.in_suite:
			return
		log.start()
		self.last_crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
		reactor.run()
 
	def queue_spider(self, spider, callback):
		crawler = Crawler(self.settings)
		self.last_crawler = crawler
		crawler.signals.connect(callback, signal=signals.spider_closed)
		crawler.configure()
		crawler.crawl(spider)
		crawler.start()
		return crawler
 
	def wrap_asserts(self, fn):
		with ThreadJoiner(1):
			self.run_in_thread(fn)

You'll use it like so:

class SimpleScrapyTestCase(BaseScrapyTestCase):
	def test_suite(self):
		BaseScrapyTestCase.in_suite = True
		self.do_test_simple()
		self.run_reactor(True)
 
	def do_test_simple(self):
		spider = Spider("site.com")
		def _fn():
			def __fn():
				self.assertTrue(False)
			self.wrap_asserts(__fn)
		self.queue_spider(spider, _fn)
		self.run_reactor()

1. Call run_reactor() at the end of test method.
2. You have to place your assertions in its own function which gets called in a ThreadJoiner so that unittest knows about assertion failures.
3. If you're testing multiple spiders, just call queue_spider() for each, and run_reactor() at the end.
4. BaseScrapyTestCase keeps track of the crawlers created, and makes sure to only attach a reactor.stop signal to the last one.

Let me know if you come up with a better/more elegant way of testing scrapy spiders!

Non-blocking/NIO HTTP requests in Java with Jetty's HttpClient

Posted by Kelvin on 05 Mar 2012 | Tagged as: programming, crawling

Jetty 6/7 contain a HttpClient class that make it uber-easy to issue non-blocking HTTP requests in Java. Here is a code snippet to get you started.

Initialize the HttpClient object.

    HttpClient client = new HttpClient();
    client.setConnectorType(HttpClient.CONNECTOR_SELECT_CHANNEL);
    client.setMaxConnectionsPerAddress(200); // max 200 concurrent connections to every address
    client.setTimeout(30000); // 30 seconds timeout; if no server reply, the request expires
    client.start();

Create a ContentExchange object which encapsulates the HTTP request/response interaction.

    ContentExchange exchange = new ContentExchange() {
      @Override protected void onResponseComplete() throws IOException {
        System.out.println(getResponseContent());
      }
    };
 
    exchange.setAddress(new Address("supermind.org", 80));
    exchange.setURL("http://www.supermind.org/index.html");
    client.send(exchange);

We override the onResponseComplete() method to print the response body to console.

By default, an asynchronous request is performed. To run the request synchronously, all you need to do is add the following line:

exchange.waitForDone();

HOWTO: Collect WebDriver HTTP Request and Response Headers

Posted by Kelvin on 22 Jun 2011 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

WebDriver, is a fantastic Java API for web application testing. It has recently been merged into the Selenium project to provide a friendlier API for programmatic simulation of web browser actions. Its unique property is that of executing web pages on web browsers such as Firefox, Chrome, IE etc, and the subsequent programmatic access of the DOM model.

The problem with WebDriver, though, as reported here, is that because the underlying browser implementation does the actual fetching, as opposed to, Commons HttpClient, for example, its currently not possible to obtain the HTTP request and response headers, which is kind of a PITA.

I present here a method of obtaining HTTP request and response headers via an embedded proxy, derived from the Proxoid project.

ProxyLight from Proxoid

ProxyLight is the lightweight standalone proxy from the Proxoid project. It's released under the Apache Public License.

The original code only provided request filtering, and performed no response filtering, forwarding data directly from the web server to the requesting client.

I made some modifications to intercept and parse HTTP response headers.

Get my version here (released under APL): http://downloads.supermind.org/proxylight-20110622.zip

Using ProxyLight from WebDriver

The modified ProxyLight allows you to process both request and response.

This has the added benefit allowing you to write a RequestFilter which ignores images, or URLs from certain domains. Sweet!

What your WebDriver code has to do then, is:

  1. Ensure the ProxyLight server is started
  2. Add Request and Response Filters to the ProxyLight server
  3. Maintain a cache of request and response filters which you can then retrieve
  4. Ensure the native browser uses our ProxyLight server

Here's a sample class to get you started

package org.supermind.webdriver;
 
import com.mba.proxylight.ProxyLight;
import com.mba.proxylight.Response;
import com.mba.proxylight.ResponseFilter;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxProfile;
 
import java.util.LinkedHashMap;
import java.util.Map;
 
public class SampleWebDriver {
  protected int localProxyPort = 5368;
  protected ProxyLight proxy;
 
  // LRU response table. Note: this is not thread-safe.
  // Use ConcurrentLinkedHashMap instead: http://code.google.com/p/concurrentlinkedhashmap/
  private LinkedHashMap<String, Response> responseTable = new LinkedHashMap<String, Response>() {
    protected boolean removeEldestEntry(Map.Entry eldest) {
      return size() > 100;
    }
  };
 
 
  public Response fetch(String url) {
    if (proxy == null) {
      initProxy();
    }
     FirefoxProfile profile = new FirefoxProfile();
 
    /**
     * Get the native browser to use our proxy
     */
    profile.setPreference("network.proxy.type", 1);
    profile.setPreference("network.proxy.http", "localhost");
    profile.setPreference("network.proxy.http_port", localProxyPort);
 
    FirefoxDriver driver = new FirefoxDriver(profile);
 
    // Now fetch the URL
    driver.get(url);
 
    Response proxyResponse = responseTable.remove(driver.getCurrentUrl());
 
    return proxyResponse;
  }
 
  private void initProxy() {
    proxy = new ProxyLight();
 
    this.proxy.setPort(localProxyPort);
 
    // this response filter adds the intercepted response to the cache
    this.proxy.getResponseFilters().add(new ResponseFilter() {
      public void filter(Response response) {
        responseTable.put(response.getRequest().getUrl(), response);
      }
    });
 
    // add request filters here if needed
 
    // now start the proxy
    try {
      this.proxy.start();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
 
  public static void main(String[] args) {
    SampleWebDriver driver = new SampleWebDriver();
    Response res = driver.fetch("http://www.lucenetutorial.com");
    System.out.println(res.getHeaders());
  }
}

Solr 3.2 released!

Posted by Kelvin on 22 Jun 2011 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

I'm a little slow off the block here, but I just wanted to mention that Solr 3.2 had been released!

Get your download here: http://www.apache.org/dyn/closer.cgi/lucene/solr

Solr 3.2 release highlights include

  • Ability to specify overwrite and commitWithin as request parameters when using the JSON update format
  • TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.
  • DebugComponent now supports using a NamedList to model Explanation objects in it's responses instead of Explanation.toString
  • Improvements to the UIMA and Carrot2 integrations

I had personally been looking forward to the overwrite request param addition to JSON update format, so I'm delighted about this release.

Great work guys!

Recap: The Fallacies of Distributed Computing

Posted by Kelvin on 01 Mar 2011 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Just so no-one forgets, here's a recap of the Fallacies of Distributed Computing

1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn’t change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.

[SOLVED] curl: (56) Received problem 2 in the chunky parser

Posted by Kelvin on 09 Oct 2010 | Tagged as: programming, crawling, PHP

The problem is described here:

http://curl.haxx.se/mail/lib-2006-04/0046.html

I successfully tracked the problem to the "Connection:" header. It seems that
if the "Connection: keep-alive" request header is not sent the server will
respond with data which is not chunked . It will still reply with a
"Transfer-Encoding: chunked" response header though.
I don't think this behavior is normal and it is not a cURL problem. I'll
consider the case closed but if somebody wants to make something about it I
can send additional info and test it further.

The workaround is simple: have curl use HTTP version 1.0 instead of 1.1.

In PHP, add this:

curl_setopt($curl_handle, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_0 );

java.net.URL synchronization bottleneck

Posted by Kelvin on 08 Dec 2009 | Tagged as: programming, crawling

This is interesting because I haven't found anything on google about it.

There's a static Hashtable in java.net.URL (urlStreamHandlers) which gets invoked with every constructor call. Well, turns out when you're running a crawler with, say 50 threads, that turns out to be a major bottleneck.

Of 70 threads, I had running, 48 were blocked on the java.net.URL ctor. I was using the URL class for resolving relative URLs to absolute ones.

Since I had previously written a URL parser to parse out the parts of a URL, I went ahead and implemented my own URL resolution function.

Went from

Status: 12.407448 pages/s, 207.06316 kb/s, 2136.143 bytes/page

to

Status: 43.9947 pages/s, 557.29156 kb/s, 1621.4071 bytes/page

after increasing the number of threads to 100 (which would not have made much difference in the java.net.URL implementation).

Cool stuff.

Average length of a URL

Posted by Kelvin on 06 Nov 2009 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Aug 16 update: I ran a more comprehensive analysis with a more complete dataset. Find out the new figures for the average length of a URL

I've always been curious what the average length of a URL is, mostly when approximating memory requirements of storing URLs in RAM.

Well, I did a dump of the DMOZ URLs, sorted and uniq-ed the list of URLs.

Ended up with 4074300 unique URLs weighing in at 139406406 bytes, which approximates to 34 characters per URL.

Is Nutch appropriate for aggregation-type vertical search?

Posted by Kelvin on 24 Sep 2007 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

I get pinged all the time by people who tell me they want to build a vertical search engine with Nutch. The part I can't figure out, though, is why Nutch?

What's vertical anyway?

So let's start from basics. Vertical search engines typically fall into 2 categories:

  1. Whole-web search engines which selectively crawl the Internet for webpages related to a certain topic/industry/etc.
  2. Aggregation-type search engines which mine other websites and databases, aggregating data and repackaging it into a format which is easier to search.

Now, imagine a biotech company comes to me to develop a search engine for everything related to biotechnology and genetics. You'd have to crawl as many websites as you can, and only include the ones related to biotechnology in the search index.

How would I implement the crawler? Probably use Nutch for the crawling and modify it to only extract links from a page if the page contents are relevant to biotechnology. I'd probably need to write some kind of relevancy scoring function which uses a mixture of keywords, ontology and some kind of similarity detection based on sites we know a priori to be relevant.

Now, second scenario. Imagine someone comes to me and want to develop a job search engine for a certain country. This would involve indexing all jobs posted in the 4 major job websites, refreshing this database on a daily basis, checking for new jobs, deleting expired jobs etc.

How would I implement this second crawler? Use Nutch? No way! Ahhhh, now we're getting to the crux of this post..

The ubiquity of Lucene … and therefore Nutch

Nutch is one of two open-source Java crawlers out there, the other being Heritrix from the good guys at the Internet Archive. Its rode on Lucene as the default choice for full-text search API. Everyone who wants to build a vertical search engine in Java these days knows they're going to use Lucene as the search API, and naturally look to Nutch for the crawling side of things. And that's when their project runs into a brick wall…

To Nutch or not to Nutch

Nutch (and Hadoop) is a very very cool project with ambitious and praiseworthy goals. They're really trying to build an open-source version of Google (not sure if that actually is the explicitly declared aims).

Before jumping into any library or framework, you want to be sure you know what needs to be accomplished. I think this is the step many people skip: they have no idea what crawling is all about, so they try to learn what crawling is by observing what a crawler does. Enter Nutch.

The trouble is, observing/using Nutch isn't necessarily the best way to learn about crawling. The best way to learn about crawling is to build a simple crawler.

In fact, if you sit down and think what a 4 job-site crawler really needs to do, its not difficult to see that its functionality is modest and humble – in fact, I can write its algorithm out here:


for each site:
  if there is a way to list all jobs in the site, then
    page through this list, extracting job detail urls to the detail url database
  else if exists browseable categories like industry or geographical location, then
    page through these categories, extracting job detail urls to the detail url database
  else 
    continue
  for each url in the detail url database:
    download the url
    extract data into a database table according to predefined regex patterns

Won't be difficult to hack up something quick to do this, especially with the help of Commons HttpClient. You'll probably also want to make this app multi-threaded.

Other things you'll want to consider, is how many simultaneous threads to hit a server with, if you want to save the HTML content of pages vs just keeping the extracted data, how to deal with errors, etc.

All in all, I think you'll find that its not altogether overwhelming, and there's actually alot to be said for the complete control you have over the crawling and post-crawl extraction processes. Compare this to Nutch, where you'll need to fiddle with various configuration files (nutch-site.xml, urlfilters, etc), where calling apps from an API perspective is difficult, you'll have to work with the various file I/O structures to reach the content (SegmentFile, MapFile etc), various issues may prevent all urls from being fetched (retry.max being a common one), if you want custom crawl logic, you'll have to patch/fork the codebase (ugh!) etc.

The other thing that Nutch offers is an out-of-box search solution, but I personally have never found a compelling reason to use it – its difficult to add custom fields, adding OR phrase capability requires patching codebase, etc. In fact, I find it much much simpler to come up with my own SearchServlet.

Even if you decide not to come up with a homegrown solution, and you want to go with Nutch. Well, here's one other thing you need to know before jumping into Nutch.

To map-reduce, or not?

From Nutch 0.7 to Nutch 0.8, there was a pretty big jump in the code complexity with the inclusion of the map-reduce infrastructure. Map-reduce subsequently got factored out, together with some of the core distributed I/O classes into Hadoop.

For a simple example to illustrate my point, just take a look at the core crawler class, org.apache.nutch.fetcher.Fetcher, from the 0.7 branch, to the current 0.9 branch.

The 0.7 Fetcher is simple and easy to understand. I can't say the same of the 0.9 Fetcher. Even after having worked abit with the 0.9 fetcher and map-reduce, I still find myself having to do mental gymnastics to figure out what's going on. BUT THAT'S OK, because writing massively distributable, scaleable yet reliable applications is very very hard, and map-reduce makes this possible and comparatively easy. The question to ask though, is, does your search engine project to crawl and search those 4 job sites fall into this category? If not, you'd want to seriously consider against using the latest 0.8x release of Nutch, and tend to 0.7 instead. Of course, the biggest problem with this, is that 0.7 is not being actively maintained (to my knowledge).

Conclusion

Perhaps someone will read this post and think I'm slighting Nutch, so let me make this really clear: _for what its designed to do_, that is, whole-web crawling, Nutch does a good job of it; if what is needed is to page through search result pages and extract data into a database, Nutch is simply overkill.

A simple API-friendly crawler

Posted by Kelvin on 01 Dec 2006 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, crawling, programming

Alright. I know I've blogged about this before. Well, I'm revisiting it again.

My sense is that there's a real need for a simple crawler which is easy to use as an API and doesn't attempt to be everything to everyone.

Yes, Nutch is cool, but I'm so tired of fiddling around with configuration files, the proprietary fileformats, and the filesystem-dependence of plugins. Also, crawl progress reporting is poor unless you're intending to be parsing log files.

Here are some thoughts on what a simple crawler might look like:

Download all pages in a site


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setOutput(new SaveToDisk(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.setScope(new HostScope(url));
    c.start();

Download all urls from a file (depth 1 crawl)


    SimpleCrawler c = new SimpleCrawler();
    c.setMaxConnectionsPerHost(5);
    c.setIntervalBetweenConsecutiveRequests(1000);
    c.addURLs(new File(file));
    c.setLinkExtractor(null);
    c.setOutput(new DirectoryPerDomain(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

Page through a search results page via regex


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setLinkExtractor(new RegexLinkExtractor(regex));
    c.setOutput(new SaveToDisk(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

Save to nutch segment for compatibility


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setOutput(new NutchSegmentOutput(segmentdir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

I'm basically trying to find the sweet-spot between Commons HttpClient, and a full-blown crawler app like Nutch.

Thoughts?

Next Page »