Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about Lucene / Solr / Elasticsearch / Nutch

PHP + Lucene integration

Posted by Kelvin on 01 Jan 2007 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

I've had very positive experiences integrating PHP front-end with a Lucene back-end, not with any fancy Java-in-PHP wizardry or Zend Lucene, but plain-old JSON-over-REST.

I've done some simple load tests, and it clearly outperforms Zend Lucene (though I don't can't offer you any numbers to back this claim up). Zend Lucene seems to suck bad at large (1GB+) indexes.

In terms of implementation, a servlet takes a query parameter, performs the search, then returns the results as JSON objects. There is a well defined API, and the solution offers the full expressiveness of Lucene query mechanism.

With a little creativity, you can even perform stuff like facet counting (see the left sidebar).

By abstracting the search presentation layer (JSON in this case), it would be a matter of minutes to add a XML or HTML display on top of it.

A simple API-friendly crawler

Posted by Kelvin on 01 Dec 2006 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Alright. I know I've blogged about this before. Well, I'm revisiting it again.

My sense is that there's a real need for a simple crawler which is easy to use as an API and doesn't attempt to be everything to everyone.

Yes, Nutch is cool, but I'm so tired of fiddling around with configuration files, the proprietary fileformats, and the filesystem-dependence of plugins. Also, crawl progress reporting is poor unless you're intending to be parsing log files.

Here are some thoughts on what a simple crawler might look like:

Download all pages in a site


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setOutput(new SaveToDisk(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.setScope(new HostScope(url));
    c.start();

Download all urls from a file (depth 1 crawl)


    SimpleCrawler c = new SimpleCrawler();
    c.setMaxConnectionsPerHost(5);
    c.setIntervalBetweenConsecutiveRequests(1000);
    c.addURLs(new File(file));
    c.setLinkExtractor(null);
    c.setOutput(new DirectoryPerDomain(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

Page through a search results page via regex


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setLinkExtractor(new RegexLinkExtractor(regex));
    c.setOutput(new SaveToDisk(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

Save to nutch segment for compatibility


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setOutput(new NutchSegmentOutput(segmentdir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

I'm basically trying to find the sweet-spot between Commons HttpClient, and a full-blown crawler app like Nutch.

Thoughts?

Search and crawling internship

Posted by Kelvin on 31 Oct 2006 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

I'm looking for a competent Java programmer who wants to get into the world of search engines and crawlers. The internship will involve a mixture of (some) training and (mostly) hands-on projects with real-world clients.

In particular, my area of expertise is in vertical search (real estate, jobs, classifieds), so more than likely, that will be the bulk of the exposure.

You'll be working mostly with Lucene and Nutch, but more than just using them, you'll be delving into their innards and pushing the limits of these libraries.

Upon graduation, if you're interested, you'll be invited to stay on and work alongside me on contracting projects, some of which might involve on-site work, perhaps overseas.

The background to this is that I'm basically tired of turning down consulting work because of my limited time, and want to start building a search/crawling consulting practice.

So if you're interested, send your resume along to internship AT supermind.org, with a brief note on your background and interest in this field.

Nutch 0.8, Map & Reduce, here I come!

Posted by Kelvin on 09 Aug 2006 | Tagged as: crawling, programming, Lucene / Solr / Elasticsearch / Nutch

Finally taking the plunge to Nutch 0.8 after exclusively working with 0.7 for over a year (and something like 5 projects).

From initial experiences, it appears that using M&R does obfuscate the code somewhat for a developer who wants to build an app off the Nutch infrastructure instead of using it out-of-box. For example, trying to decipher what's going on in org.apache.nutch.indexer.Indexer is pretty difficult, compared to its 0.7 counterpart (IndexSegment).

Some serious documentation needs to be done around the implementation details of M&R. I'll keep posting about my learnings..

Lucene scoring for dummies

Posted by Kelvin on 08 Mar 2006 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, crawling

The factors involved in Lucene's scoring algorithm are as follows:

1. tf = term frequency in document = measure of how often a term appears in the document
2. idf = inverse document frequency = measure of how often the term appears across the index
3. coord = number of terms in the query that were found in the document
4. lengthNorm = measure of the importance of a term according to the total number of terms in the field
5. queryNorm = normalization factor so that queries can be compared
6. boost (index) = boost of the field at index-time
7. boost (query) = boost of the field at query-time

The implementation, implication and rationales of factors 1,2, 3 and 4 in DefaultSimilarity.java, which is what you get if you don't explicitly specify a similarity, are:

note: the implication of these factors should be read as, "Everything else being equal, … "

1. tf
Implementation: sqrt(freq)
Implication: the more frequent a term occurs in a document, the greater its score
Rationale: documents which contains more of a term are generally more relevant

2. idf
Implementation: log(numDocs/(docFreq+1)) + 1
Implication: the greater the occurrence of a term in different documents, the lower its score
Rationale: common terms are less important than uncommon ones

3. coord
Implementation: overlap / maxOverlap
Implication: of the terms in the query, a document that contains more terms will have a higher score
Rationale: self-explanatory

4. lengthNorm
Implementation: 1/sqrt(numTerms)
Implication: a term matched in fields with less terms have a higher score
Rationale: a term in a field with less terms is more important than one with more

queryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. It is implemented as 1/sqrt(sumOfSquaredWeights)

So, roughly speaking (quoting Mark Harwood from the mailing list),
* Documents containing *all* the search terms are good
* Matches on rare words are better than for common words
* Long documents are not as good as short ones
* Documents which mention the search terms many times are good

The mathematical definition of the scoring can be found at http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html

Hint: look at NutchSimilarity in Nutch to see an example of how web pages can be scored for relevance

Customizing scoring

Its easy to customize the scoring algorithm. Just subclass DefaultSimilarity and override the method you want to customize.

For example, if you want to ignore how common a term appears across the index,


Similarity sim = new DefaultSimilarity() {
  public float idf(int i, int i1) {
    return 1;
  }
}

and if you think for the title field, more terms is better


Similarity sim = new DefaultSimilarity() {
  public float lengthNorm(String field, int numTerms) {
    if(field.equals("title")) return (float) (0.1 * Math.log(numTerms));
    else return super.lengthNorm(field, numTerms);
  }
}

OC and focused crawling

Posted by Kelvin on 26 Feb 2006 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, crawling

I've had the good fortune to get paid to work on OC (Our Crawler). Features I've been developing have been for focused crawling purposes.

Specifically:

  1. Ranking content by relevance to a supplied query and crawling the most relevant links first, with the possibility of specifying a score threshold
  2. Checkpointing the crawl output (which is a Nutch segment) at time intervals, e.g. every 60 minutes. This is insurance against hung crawls, or if the crawler hit a bot-trap and couldn't exit.
  3. Time-limited "perpetual crawling" where the crawler would keep going until a time limit was reached, in which case it will stop all threads and exit gracefully.
  4. Introducing various fetchlist filters which reduce the chances of getting lost in bot-traps, such as don't go further than x levels deep within a host, and reject URLs which repeatedly increase the number of query parameters.
  5. MySQL and BDB-backed persistence.

In addition, some refactoring has also taken place that makes it easier to run crawls via API (as opposed to command-line or Spring). The role of Spring has also been relegated from obligatory to optional (but sweet to have, all the same).

We're still discussing the details of whether all of the code can be open-sourced, though. I'm keeping my fingers crossed.

Next on the plate is support for distributed crawling. Will OC use Nutch's Map-Reduce? That remains to be seen…

The next few months for OC

Posted by Kelvin on 28 Jan 2006 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, crawling

I had a chat with Mike from Atlassian recently, and have arrived at the conclusion that the future of OC lies in being a crawler API, much like what Lucene does for searching. I suppose it will lie somewhere between Nutch (full-blown whole-web crawler) and Commons HTTPClient.

Some directions I will explore include:

  • Introducing checkpointing to recover from hung/crashed crawls
  • A GUI app (probably thinlet-based) for monitoring crawl status
  • Authentication databases for sites (username/password or cookies)
  • Alternatives to Nutch's SegmentFile

I expect to have some free time on my hands to resume work on OC in the coming months.

Update 270106:
Well, checkpointing at the segment level is done at least. So I can't yet recover from a failed crawl, but at least I don't lose everything. 🙂 Its a timer-based checkpointer, so it closes the segment writers every 60 minutes and opens a new one for writing.

Database mining would be very cool. With support for paging and stuff, though it feels like its somewhat peripheral to OC. If we include some kind of regex/parsing framework for screenscraping, we would already have like 85% of what we need to build a vertical search portal.

There is already an alternative to SegmentFile for online DB updates, and its BerkeleyDB. A simple BerkeleyDB persister has already been written.. but but but.. I don't like that its GPL (albeit a looser version of GPL). So, one day when I'm feeling particularly inspired, I'll hack up my own BTree and external HashMap implementation.

Now, a GUI for crawling would be totally sweet. In fact, that's probably what I'm going to work on next.

Crawling Basics

Posted by Kelvin on 27 Oct 2005 | Tagged as: work, Lucene / Solr / Elasticsearch / Nutch, crawling

Here's my attempt at providing an introduction to crawling, in the hope that it'll clarify some of my own thoughts on the topic.

The most basic tasks of a HTTP crawler are:

  1. Download urls via the HTTP protocol.
  2. Extract urls from downloaded pages to be added to the download queue.

That's actually very simple, and crawlers are very simple applications at heart. 2 factors complicate crawler development:

  • they deal with large datasets, and require care when designing the appopriate data structures and algorithms for acceptable performance.
  • they require either multi-threaded or non-blocking IO implementations, and both are non-trivial to do well.

To be abit more intelligent/useful, crawlers usually also provide some way of

  1. restricting the scope of the crawl
  2. prioritizing the download queue
  3. storing the downloaded pages for easy retrieval, preferably in compressed form
  4. logging and status reporting
  5. normalizing the urls
  6. being compliant with HTTP specifications, e.g. using robots.txt
  7. being polite to servers
  8. recrawling urls

Let's go through the interesting ones..

  • restricting the scope of the crawl: generally accomplished via 2 methods – duplicate url detection (removing urls that have already been downloaded or are already in the queue) and user-defined crawl scopes. The 1st is performed automatically by the crawler, and the 2nd by the user. Ways in which the user can limit the crawl scope include: limit urls to same hosts as seed urls, limit by depth from seed, regex, etc.

    A note about restricting scope of crawl by content for whole-web focused crawlers: Content-based sites tend to cluster (i.e. link to each other) and crawling an entire cluster is easy. However, clusters not linked from seed urls are difficult to get to, and moving up a cluster might be difficult, i.e. given 2 clusters A and B and 2 link hubs A1 and B1, A1 and B1 might share a link, but the common site may not link to both or either of them.

  • prioritizing the download queue: important for both whole-web and focused crawlers, but for different reasons.

    Whole-web crawlers
    The last thing a whole-web crawler wants to do is crawl pages indiscriminately. It generally wants to crawl pages that are both authoritative and have a high incoming and outgoing link count (hubs) as early as possible. It also needs to distribute requests to be polite to its servers.

    Focused crawlers
    Focused crawlers can be further broken down into

    • whole-web focused crawlers which are interested in a subset of the whole-web, usually partitioned by language or content. Queue prioritization would be similar to whole-web.
    • Site mirroring focused crawlers (SMFC) which are only interested in their seed urls (and possibly all pages within that domain/host). SMFCs can radically decrease overall crawl time by being smart about prioritzing their urls, especially when the hosts to crawl are small (<100). Btw, I suspect that the majority of developers who use Nutch actually belong to this category.

  • url normalization: ensures that urls that would be filtered out don't sneak through. Includes remembering redirects encountered, for example.

More advanced crawlers also

  • checkpoint at regular intervals so interrupted crawls can be recovered
  • detect and avoid bot traps
  • throttle bandwith usage according to bandwidth availability and server response time
  • allow crawling speed to be scaled (linearly) by adding more crawl boxes

Ignoring the advanced features, the high-level modules are:

  • HTTP request and response module
  • Link extractor (can be regex or html parser)
  • download queue (we'll call this the fetchlist from now on), and which urls are currently being downloaded
  • database of downloaded urls
  • database of downloaded pages
  • link graph/database
  • url normalizer

That's all for now. In a follow-up post, I'll walkthrough the typical application flow of a crawler, and the interactions between the modules/data structures.

Practical introduction to Nutch MapReduce

Posted by Kelvin on 28 Sep 2005 | Tagged as: work, Lucene / Solr / Elasticsearch / Nutch, crawling

Some terminology first:

Mapper
Performs the map() function. The name will make sense when you look at it as "mapping" a function/operation to elements in a list.
Reducer
Performs the reduce() function. Merges multiple input values with the same key to produce a single output value.
OutputFormat/InputFormat
Classes which tell Nutch how to process input files and what output format the MapReduce job is to produce.

Currently implemented classes: MapFile (output only), Text and SequenceFile. What this basically means is that out-of-box, Nutch support SequenceFiles and text files as input formats for MapReduce jobs.

Partitioner
Splits up input into n different partitions, where n is the number of map tasks.
Combiner
Javadoc says: Implements partial value reduction during mapping.. There is no special interface for a Combiner. Rather, it is a Reducer which is configured to be used during the mapping phase.

To complete a MapReduce job successfully, Nutch requires at least

  • a directory containing input files (in the format determined by JobConf.setInputFormat, which defaults to newline-terminated text files)
  • a mapper or reducer (even though technically the job will still complete even if none is provided, as the HelloWorld example shows)
  • output format

To map or to reduce?
Rule of thumb: when performing operation on one input key/value, use Mapper; when requiring multiple input, or merging files, use Reducer.

Technically, the implementation of MapReduce in Nutch allows for the map function to be performed by only the Reducer, although this would seem to be a rather inappropriate use of a Reducer.

Hello World for MapReduce

Posted by Kelvin on 28 Sep 2005 | Tagged as: work, Lucene / Solr / Elasticsearch / Nutch, crawling

Here's a Hello World tutorial as part of my attempts to grok MapReduce.

HelloWorld.java :

import org.apache.nutch.mapred.JobClient;
import org.apache.nutch.mapred.JobConf;
import org.apache.nutch.util.NutchConf;
import java.io.File;

public class HelloWorld {

  public static void main(String[] args) throws Exception {
    if (args.length < 1) {
      System.out.println("HelloWorld ");
      System.exit(-1);
    }

    NutchConf defaults = NutchConf.get();
    JobConf job = new JobConf(defaults);
    job.setInputDir(new File(args[0]));
    job.setOutputFormat(ConsoleOutputFormat.class);
    JobClient.runJob(job);
  }
}

and ConsoleOutputFormat.java (for printing to System.out):

import org.apache.nutch.fs.NutchFileSystem;
import org.apache.nutch.io.Writable;
import org.apache.nutch.io.WritableComparable;
import org.apache.nutch.mapred.JobConf;
import org.apache.nutch.mapred.OutputFormat;
import org.apache.nutch.mapred.RecordWriter;
import org.apache.nutch.mapred.Reporter;

import java.io.IOException;

public class ConsoleOutputFormat implements OutputFormat {
  public RecordWriter getRecordWriter(NutchFileSystem fs, JobConf job, String name) throws IOException {
    return new RecordWriter() {
      public void write(WritableComparable key, Writable value) {
        System.out.println(value);
      }

      public void close(Reporter reporter) {

      }
    };
  }
}

And now create a new directory someplace, and create a new file (say, foo.txt). Fire up your text editor, and in foo.txt, enter:

Hello World

Important: There MUST be a newline at the end of the file, following the words "Hello World". It is also important (for the purposes of this tutorial), that a new directory be created and there is only one file in this directory.

Now, run the HelloWorld application, providing the location of the directory where foo.txt resides, for example

java HelloWorld /tmp/mapreduce/

Note: Its best to run the application directly from your IDE, so you won't have to worry about adding the necessary libs to the classpath.

After some output from Nutch about parsing the config files, you should see the text Hello World.

Congratulations! You've run your first MapReduce program!

« Previous PageNext Page »