Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about programming

Spam, moderation and life..

Posted by Kelvin on 01 Jan 2007 | Tagged as: programming

Just spent the last 2 hours cleaning up my spam-filled moderation queue, and have enabled Spam Karma (http://unknowngenius.com/blog/wordpress/spam-karma/).

Apologies to anyone whose comments never made it public!

PHP + Lucene integration

Posted by Kelvin on 01 Jan 2007 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

I've had very positive experiences integrating PHP front-end with a Lucene back-end, not with any fancy Java-in-PHP wizardry or Zend Lucene, but plain-old JSON-over-REST.

I've done some simple load tests, and it clearly outperforms Zend Lucene (though I don't can't offer you any numbers to back this claim up). Zend Lucene seems to suck bad at large (1GB+) indexes.

In terms of implementation, a servlet takes a query parameter, performs the search, then returns the results as JSON objects. There is a well defined API, and the solution offers the full expressiveness of Lucene query mechanism.

With a little creativity, you can even perform stuff like facet counting (see the left sidebar).

By abstracting the search presentation layer (JSON in this case), it would be a matter of minutes to add a XML or HTML display on top of it.

A simple API-friendly crawler

Posted by Kelvin on 01 Dec 2006 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Alright. I know I've blogged about this before. Well, I'm revisiting it again.

My sense is that there's a real need for a simple crawler which is easy to use as an API and doesn't attempt to be everything to everyone.

Yes, Nutch is cool, but I'm so tired of fiddling around with configuration files, the proprietary fileformats, and the filesystem-dependence of plugins. Also, crawl progress reporting is poor unless you're intending to be parsing log files.

Here are some thoughts on what a simple crawler might look like:

Download all pages in a site


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setOutput(new SaveToDisk(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.setScope(new HostScope(url));
    c.start();

Download all urls from a file (depth 1 crawl)


    SimpleCrawler c = new SimpleCrawler();
    c.setMaxConnectionsPerHost(5);
    c.setIntervalBetweenConsecutiveRequests(1000);
    c.addURLs(new File(file));
    c.setLinkExtractor(null);
    c.setOutput(new DirectoryPerDomain(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

Page through a search results page via regex


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setLinkExtractor(new RegexLinkExtractor(regex));
    c.setOutput(new SaveToDisk(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

Save to nutch segment for compatibility


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setOutput(new NutchSegmentOutput(segmentdir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

I'm basically trying to find the sweet-spot between Commons HttpClient, and a full-blown crawler app like Nutch.

Thoughts?

Search and crawling internship

Posted by Kelvin on 31 Oct 2006 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

I'm looking for a competent Java programmer who wants to get into the world of search engines and crawlers. The internship will involve a mixture of (some) training and (mostly) hands-on projects with real-world clients.

In particular, my area of expertise is in vertical search (real estate, jobs, classifieds), so more than likely, that will be the bulk of the exposure.

You'll be working mostly with Lucene and Nutch, but more than just using them, you'll be delving into their innards and pushing the limits of these libraries.

Upon graduation, if you're interested, you'll be invited to stay on and work alongside me on contracting projects, some of which might involve on-site work, perhaps overseas.

The background to this is that I'm basically tired of turning down consulting work because of my limited time, and want to start building a search/crawling consulting practice.

So if you're interested, send your resume along to internship AT supermind.org, with a brief note on your background and interest in this field.

Normalized Google Distance

Posted by Kelvin on 30 Oct 2006 | Tagged as: programming

http://blog.outer-court.com/archive/2005-01-27-n48.html has an interesting article on Normalized Google Distance.

In short, using google page counts to determine the semantic distance/similarity between 2 words.

I unknowingly used this in a recent project where we were attempting to detect the sentiment of blogs. For example, is a blog post positively or negatively slanted towards a movie.

The general idea was to come up with a list of positive and negative words, then extract adjectives from the blog post and run these side-by-side against the positive and negative wordlist against google.

For example, given a blog post containing the adjectives "dry", "witty" and "obfuscated", without a priori knowledge or a database of +ve or -ve adjectives, we would run google queries of these adjectives against +ve words like "happy","bright", "light" and "funny", and -ve words like "unhappy", "sad", "troubled" and see which adjectives have a higher correlation against the respective +ve and -ve words.

How's that for AI?

Nutch 0.8, Map & Reduce, here I come!

Posted by Kelvin on 09 Aug 2006 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Finally taking the plunge to Nutch 0.8 after exclusively working with 0.7 for over a year (and something like 5 projects).

From initial experiences, it appears that using M&R does obfuscate the code somewhat for a developer who wants to build an app off the Nutch infrastructure instead of using it out-of-box. For example, trying to decipher what's going on in org.apache.nutch.indexer.Indexer is pretty difficult, compared to its 0.7 counterpart (IndexSegment).

Some serious documentation needs to be done around the implementation details of M&R. I'll keep posting about my learnings..

OC and Nutch MapReduce

Posted by Kelvin on 15 Sep 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch, crawling

(or what's next for OC)…

I've received a couple of emails about what the future of OC vis-a-vis incorporating into Nutch codebase, the upcoming MapReduce merge into trunk, etc.

My thoughts are:

  1. When MapReduce is merged into trunk, I'll make appropriate changes to OC to support MapReduce.
  2. This MapReduce-compatible OC will be offered to the Nutch codebase. As of today, I've removed most usages of JDK1.5 features, so the other thing that needs to be removed is Spring Framework dependencies.
  3. I _might_ keep a version of OC which is more experimental and uses Spring, and can operate on a more standalone basis. The advantages (more autonomy over what I can do and not, Spring support) will have to be balanced against the disadvantages (duplication).

27092005 edit
I'm browsing MapReduce sources now, and I've found the complexity of the Fetcher has rather significantly increased. I'm leaning towards maintaining a simple non-mapred fetcher for folks who don't need the MapReduce scalability.

Inside Our Crawler

Posted by Kelvin on 25 Aug 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Inspired by http://wiki.apache.org/nutch/DissectingTheNutchCrawler

I guess the contents of this post will eventually make it to javadocs. Note that Our Crawler (OC) is really not so different from Nutch Crawler (NC). This document highlights the main differences, as well as important classes.

CrawlTool
CrawlTool is the point of entry to OC. It doesn't do very much really, just calls Fetcher.run().

Fetcher
The Fetcher doesn't do very much either. It starts the FetcherThreads, and provides them SeedURLs from a CrawlSeedSource. Its 2 other main responsibilies are: distributing URLs amongst threads, and reporting FetcherStatus. In a minor design quirk, the PostFetchProcessor is also found in Fetcher and not in FetcherThread instances, thus imposing the requirement on PostFetchProcessors to be thread-safe. The rationale behind this, is to be able to write to a single Nutch segment, instead of requiring a x-way post-fetch segment merge where x is the number of fetcher threads (I haven't put much thought into this. If post-fetch processing is substantial, then it makes more sense to make this a per-thread thing)

FetcherThread
A FetcherThread has the following high-level responsibilities:

  1. Maintain FetchList and db of fetchedurls for a set of urls (as determined by Fetcher's url distribution strategy). FetcherThread is also responsible for avoiding duplicates between URLs (whether fetched, parsed or output)
  2. Delegate the actual URL downloading to Http and HttpResponse classes
  3. Process the outcome of downloading, which primarily has the following steps:
    1. Parse the HTML
    2. Extract links from parsed page and run each through FetchListScope, adding to relevant thread's link queue (for subsequent adding into fetchlist) if link is allowed
    3. Run fetch output through PostFetchScope, passing to PostFetchProcessor if allowed

FetcherThread periodically moves items from its link queue to its fetchlist. The link queue is the thread-safe holding area where other fetcher threads add URLs to be fetched.

A note about the PostFetchProcessor : I know it sounds weird :-).
Initially I called it something like PageOutputter or something, which leaves an equally bad taste in my mouth. I wanted to capture the idea of something that happens _after_ a url/page is downloaded, whether saving to Nutch segment, sending an email to someone every 100 downloaded pages or triggering a backup.

FetchList
Now on to the fetchlist system, one of the biggest differences between NC and OC. The main actors in are FetchList and HostQueue. Unlike Nutch where a FetchList is a sequence of URLs, our FetchList is a sequence of HostQueues, which are in turn a sequence of URLs with the same host. The FetchList also manages the server politeness policy (unlike Nutch where this is done in the Http class).

Different implementations of FetchList may choose different strategies of how to go about prioritizing certain hosts/URLs over others. This process can even be randomized (in which case, it somewhat simulates the Nutch fetchlist, although Nutch's fetchlist is deterministic and OC's DefaultFetchList is NOT).

LastModifiedDB and FetchedURLs
These are both new to OC, even though the WebDB serves as roughly the equivalent. Javadocs should be sufficient for understanding the roles of these 2 classes. One thing to note about FetchedURLs: to save space, the current implementation does _not_ save the actual URL, but rather a 64-bit checksum.

Scopes and Filters
A Scope consists of zero or more filters, where given an input, each filter replies ALLOW, REJECT or ABSTAIN. Self-explanatory. When all filters in a scope abstain, then the scope's allowByDefault value kicks in (also used when a scope has no filters).

The different scopes in use are: FetchListScope, ParseScope and PostFetchScope.

Our Crawler Todo List

Posted by Kelvin on 25 Aug 2005 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

In the order in which these are jumping off my brain (read: no order whatsoever):

  • If-modified-since
    OC already has the basic infrastructure in place to implement conditional downloading of pages based on the If-modified-since HTTP header. This just needs to be implemented in the Http and HttpResponse classes.
  • Re-use socket connections even for redirects
    Right now, redirects are always opening new socket connections, even if the redirected URL is on the same host as the original. There's no reason why this should be so.
  • URL duplication is only checked from a fetcherthread's unfetched URLs and its already-fetched URLs, but not the URLs currently being fetched. This is a problem that will allow duplicates to sneak in.
  • Customize per-host crawling params
    If host parameters such as HTTP version, response time and number of pages can be captured and stored, the crawler might be able to modify its interaction with the server accordingly. For example, by choosing to pipeline less requests for slower hosts.
  • Use Commons HttpClient as alternative HTTP implementation
    Although HttpClient doesn't support pipelining, it does provide connection persistence by default. It would be interesting to compare speed differences between this and the pipelined version. Furthermore, HttpClient is much more robust in handling the various nuances webservers tend to throw our way (Andrzej has documented these in the Nutch Jira)
  • Bot-trap detection
    As mentioned in Limitations of OC, bot-traps can be fatal for OC.
  • Checkpointing
    On long crawls, schedule a checkpoint every x minutes so that the crawler can be easily resumed from the last checkpoint in the case of a hang. I don't think this will be difficult to implement, if abit routine (read: boring).
  • Runtime management (via JMX?)
    The Heritrix project has taken the JMX route in incorporating runtime management into their crawler. I don't know enough about JMX to decide if this is a good idea, but I have heard that JMX is icky. I suppose its either JMX or client-server? These 2 seem to be the only 2 options I have if I want to build a GUI to OC.
  • Implement ExtendibleHashFile and a fixed-length optimized version of SequenceFile/MapFile.

Limitations of OC

Posted by Kelvin on 19 Aug 2005 | Tagged as: work, programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Follow-up post to some Reflections on modifying the Nutch crawler.

The current implementation of Our Crawler (OC) has the following limitations:

  1. No support for distributed crawling.
    I neglected to mention that by building the fetchlist offline, the Nutch Crawler (NC) has an easier job splitting the crawling amongst different crawl servers. Furthermore, because the database of fetched URLs (WebDB) is also modified offline, its real easy to check if a URL has already been fetched.

    Since both fetchlist building and fetched URL DB is modified online in OC, the multiple crawl server scenario complicates things somewhat. My justification to omit this in the initial phase at least, is that for most constrained crawls, a single crawl server (with multiple threads) is the most probable use case, and also most likely sufficient.

  2. No load-balancing of URLs amongst threads.
    Currently URL distribution is excessively simple (url.getHost().hashCode() % numberOfThreads), and the only guarantee the Fetcher makes, is that every URL from the same host will go to the same thread (an important guarantee, since each fetcherthread maintains its own fetchlist and database of fetched URLs). This also means that its pointless using 20 threads to crawl 3 hosts (check out the previous post if you're not clear why)

    However, when some hosts are significantly larger than others, then its highly probable that the number of pages each thread has to fetch is uneven, resulting in sub-optimal fetch times. This calls for a more sophisticacted method of assigning URLs to threads, but still maintaining the thread-host contract.

  3. No bot-trap detection.
    With NC, since the depth of the crawl (determined by the number of iterations the fetcher is run) is limited, bot-traps (intentional or otherwise) aren't a major concern.

    When letting OC loose on a site, however, bot-traps can be a big problem because OC will continue to run as long as there are still items in the fetchlist.

  4. Disk writes in multiple threads may nullify SequenceFile performance gains
    Each thread maintains its own database of fetched URLs (via a MapFile). When using alot of threads, its quite likely that multiple threads will be writing to disk at once. To be honest, I don't quite know enough about this to know if its a problem, but depending on hardware and HD utilization, I think its possible that the disk heads may end up jumping around to satisfy the simultaneous writes? If so, SequenceFile's premise of fast sequential write-access is no longer valid.

These are what I can think off right now. Will update this list as I go along.

« Previous PageNext Page »