Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about programming

Ant, subant and basedir troubles

Posted by Kelvin on 11 Nov 2009 | Tagged as: programming

I just finished setting up a multi-project Ant build system and thought I'd blog about it.

My build requirements were exactly what http://www.exubero.com/ant/dependencies.html described, so I followed the recommendations pretty much to the letter.

However, it bombed in one area. One of my projects had a sub-folder where there were sub-projects and I called their respective ant files using subant. However, the sub-projects' basedir was always set to the super project's basedir and the build never completed properly.

Turns out there's a problem mixing ant and subant (in Ant 1.7.1).

Workaround was simple but finding the problem took me 4 hours. Instead of something like this in dependencies.xml,

<target name="depend.model" depends="depend.utilities">
<ant dir="${dependencies.basedir}/model" inheritAll="false"/>
</target>

do this instead

<target name="depend.model" depends="depend.utilities">
<subant inheritAll="false">
<fileset dir="${dependencies.basedir}/model" includes="build.xml"/>
</subant>
</target>

The subant task is a superset of the ant task so it should be a dropin replacement.

Average length of a URL

Posted by Kelvin on 06 Nov 2009 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Aug 16 update: I ran a more comprehensive analysis with a more complete dataset. Find out the new figures for the average length of a URL

I've always been curious what the average length of a URL is, mostly when approximating memory requirements of storing URLs in RAM.

Well, I did a dump of the DMOZ URLs, sorted and uniq-ed the list of URLs.

Ended up with 4074300 unique URLs weighing in at 139406406 bytes, which approximates to 34 characters per URL.

TokyoCabinet HDB slowdown

Posted by Kelvin on 10 Oct 2009 | Tagged as: work, programming

http://www.dmo.ca/blog/benchmarking-hash-databases-on-large-data/ reported that with a large number of records, puts become increasingly slower.

I experienced a similar phenomenon, and just stumbled upon http://parand.com/say/index.php/2009/04/09/tokyo-cabinet-observations/ , where I realized my problem was with bnum being too small (default of 128k).

According to docs, bnum is

number of elements of the bucket array. If it is not more than 0, the default value is specified. The default value is 131071 (128K). Suggested size of the bucket array is about from 0.5 to 4 times of the number of all records to be stored.

So, when you're dealing with a large number of records with the Tokyo Cabinet HDB, don't forget to increase the size of bnum accordingly.

2GB limit with Tokyo Cabinet (aka invalid meta data)

Posted by Kelvin on 28 Sep 2009 | Tagged as: programming

Ran into a bunch of frustrating problems with Tokyo Cabinet recently.

When my hash database was reaching 2GB in size, the datafile would become corrupt. What's scary is that at that stage, writes to the database seemingly disappeared into the void.

When reopening this corrupt DB, I'd get an "invalid meta data" exception. I googled for "tokyo cabinet invalid meta data" and didn't turn up anything useful. There are, of course, no forums to ask for help.

After re-reading javadocs, I noticed the TLARGE large database flag. Good.. enabled it and reran the data import. Only to have it crash again.

Turns out you ALSO need to _compile_ TC (NOT TC-java bindings) with the --enable-off64 flag enabled during the configure stage. Like so

./configure --enable-off64
make && sudo make install

If you had already compiled it previously, you also need to make clean first.

LightVC – a simple and elegant PHP framework

Posted by Kelvin on 28 Sep 2009 | Tagged as: programming, PHP

Whilst working on a recent project involving clinical trials, I stumbled on LightVC, a php framework. Yes.. yet ANOTHER php framework.

Its emphasis on simplicity and minimalism caught my eye and I decided to give it a whirl.

3 months later, I have to admin I'm a total fan. It makes the simple stuff easy, and the tough stuff.. well.. possible. It is a pure view-controller framework w/o ORM. Perfect because my backend is usually handled by Solr anyway.

Highly recommended if you're not already invested in Zend or one of the biggies (cakephp, symfony, etc)

Spatial searching multiple locations per Solr/Lucene doc

Posted by Kelvin on 09 Sep 2009 | Tagged as: programming

There are a number of solutions for geosearching/spatial search in Lucene and Solr.

LocalLucene and LocalSolr are excellent options.
LuceneTutorial.com describes a partial solution for doing it the old-school way using FieldCache and a custom lucene query.
– The new TrieRange feature introduced by Uwe Schindler also offers a new way of performing range searches on numbers.

Here's the twist though: all of the solutions mentioned above work only when there is a single lat/long pair _per_ document.

When would we need to have multiple lat/longs per document?

An example would be an event search engine where a single event can have multiple locations and you'd like all locations to turn up when searching within a zipcode, but for them to appear under their respective events.

Here's an algorithm I came up with that works well enough for smallish indexes:

Given that there are "container" documents with multiple locations,

1. index container and location as separate document types
2. assign uid fields to container and location docs (standard in solr)
3. location docs have a lat/long field (indexed as a string), and a "reference" to the container id value
4. load all location docs as Point2D.Float into memory
5. when a geo search request comes, convert to lat/long, then produce a bounding rectangle encapsulating the desired radius
6. iterate through the set of Point2D.Floats, saving the points within the bounds
7. obtain the list of container ids these points contain
8. construct a query filter out of these container ids
9. finally, perform the search on container docs with the query filter

For bonus points in Solr, you can easily add the requisite location doc ids into the solr response so you can reference exactly which locations were matched to which container.

Its long and somehow feels abit hackish, but it works, and blazing fast coz its all in memory.

Idea: 2-stage recovery of corrupt Solr/Lucene indexes

Posted by Kelvin on 09 Sep 2009 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

I was recently onsite with a client who happened to have a corrupt Solr/Lucene index. The CheckIndex tool (lucene 2.4+) diagnosed the problem, and gave the option of fixing it.

Except… fixing the index in this case meant losing the corrupt segment, which also happened to be the one containing over 90% of documents.

Because Solr has the concept of a doc uid (which Lucene doesn't have), what I did was write a tool for them to dump out the uids in that corrupted segment into a text file, so after recovering the index, they were able to reindex the docs that were lost in that segment.

Benchmarks for various approaches to serializing java objects

Posted by Kelvin on 10 Jul 2009 | Tagged as: programming

http://www.eishay.com/2009/03/more-on-benchmarking-java-serialization.html

Two simple optimizations to DP algorithm for calculating Levenstein edit distance

Posted by Kelvin on 01 Jul 2009 | Tagged as: programming

Levenstein/edit distance is most often calculated using a dynamic programming (DP) algorithm.

The algorithm goes like this:

1. given 2 strings, s and t
2. instantiate d, an m x n matrix where m = length of s + 1 and n = length of t + 1
3. for each char in s
4. for each char in t
5. set d[i][j] = minimum ( d[i][j-1] + 1, d[i – 1][j] + 1, d[i – 1][j – 1] + cost), where cost = 0 if s.charAt(i) == t.charAt[j] and 1 otherwise

The edit distance is the value of d[length of s][length of t]

http://www.mgilleland.com/ld/ldjavascript.htm has a fantastic visualizer for the algorithm in javascript.

This is the base case for the operations: insert, delete and replace/substitute.

To support adjacent transpositions, we need an additional

int trans	= d[ 0 ][ i - 2 ] + 1;
if ( s.charAt( i - 2 ) != t ) trans++;
if ( s != t.charAt( j - 2 ) ) trans++;
if ( d[i][j] > trans ) d[i][j] = trans;

2 simple optimizations to this algorithm:

1. Instead of storing m x n matrix (which would result in OutOfMemoryErrors for large strings), you can store only the last row of calculations (or the last 2 rows when using transpositions).

2. Apply Ukkonen's cutoff algorithm which reduces table evaluations such that not every row of the matrix need be populated. This algorithm evaluates O(k^2) entries where k = max errors.

Ukkonen's algorithm was published in E. Ukkonen. Finding approximate patterns in strings. Journal of Algo-
rithms, 6:132{7, 1985.
, but I couldn't find a version of it freely available online. It is, however,described in Tries for Approximate Matching.

The general idea is to start evaluating a row at the previous column's largest value. Read the paper for more details.

Trie-based approximate autocomplete implementation with support for ranks and synonyms

Posted by Kelvin on 01 Jul 2009 | Tagged as: programming

The problem of auto-completing user queries is a well-explored one.

For example,
Type less, find more: fast autocompletion search with a succinct index
http://stevedaskam.wordpress.com/2009/06/07/putting-autocomplete-data-structure-to-the-test/
http://suggesttree.sourceforge.net/
http://sujitpal.blogspot.com/2007/02/three-autocomplete-implementations.html

However, there's been little written about supporting synonyms and approximate matching for the purpose of autocompletion.

The approach for autocompletion I'll be discussing in this article supports the following features:

– given a prefix, returns all words starting with that prefix
– approximate/fuzzy prefix matching based on k-errors / edit distance / Levenstein distance, with the operations: substitution, insert and delete (though adjacent transpositions would be trivial to add)
– support for ranks (i.e. a number associated with a word that affects ranking, like number of search results for a given phrase)
– support for synonyms

The search algorithm is independent of dictionary size, and dependent on k (edit distance) and length of prefix.

Data Structures

The data structure used is a trie. Words are broken into characters. Each character forms a node in a tree.

To support synonyms, a pointer is added to terminal nodes which points to the canonical word of the synonym ring.

To support ranks, a pointer is added to terminal nodes with the value of the rank. At search time, nodes are sorted first by edit distance, then by rank.

Implementation

At index-time, as mentioned above, a trie is built from input words/lexicon. No other preprocessing is done.

At search-time, dynamic programming (DP) is applied on a depth-first traversal of the trie, to collect all the "active sub-tries" of the trie which are less than k errors from the given prefix.

Where traditional DP uses a m x n matrix for its DP table where m and n are the 2 strings to be compared, we instantiate a single (prefix length + max k) x (prefix length) matrix for the entire trie, where prefix is the string for which we want to produce autocomplete results for.

Why (prefix length + max k) x (prefix length) ?

1. We don't need to compare full strings because we're only interested in collecting active sub-tries which satisfy the k-errors constraint.

2. For cases where the length of the word to be evaluated is greater than the length of the prefix, evaluations of the word should be performed up to prefix length + maximum k. This is to take into account the scenario where the first k characters of a word, when deleted, satisfy the edit distance constraint. e.g. prefix of hamm and word bahamm, with k=2.

Cutoff optimizations

Each level of the trie has a non-decreasing value of k (edit distance).

Therefore when k > maximum k, we can proceed to reject the entire sub-trie of that node. In other words, given the prefix hamm, and maximum k of 1, when we encounter hen which has k=2, we can discard any children of hen because their k-values will be >= 2.

Collecting autocomplete results

With a list of active sub-tries, we will then proceed to collect all terminal strings, sort them by edit distance and rank, and return the top n results.

Further explorations

1. Implementing adjacent transpositions
2. Implementing Ukkonen's cutoff algorithm for DP (not to be confused with Ukkonen's algorithm for constant-time creation of suffix trees)
3. Comparing performance of tries vs compact tries (where non-branching nodes are collapsed)

« Previous PageNext Page »