Custom Solr QueryParsers for fun and profit
Posted by Kelvin on 09 Sep 2013 | Tagged as: Lucene / Solr / Elasticsearch / Nutch
In this post, I'll show you what you need to do to implement a custom Solr QueryParser.
Step 1
Extend QParserPlugin.
public class TestQueryParserPlugin extends QParserPlugin { public void init(NamedList namedList) { } @Override public QParser createParser(String s, SolrParams localParams, SolrParams params, SolrQueryRequest req) { return new TestQParser(s, localParams, params, req); } }
This is the class you'll define in solrconfig.xml, informing Solr of your queryparser. Define it like so:
<queryParser name="myfunparser" class="org.supermind.solr.queryparser.TestQParserPlugin"/>
Step 2
Extend QParser.
public class TestQParser extends QParser { public TestQParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { super(qstr, localParams, params, req); } @Override public Query parse() throws SyntaxError { return null; } }
Step 3
Actually implement the parsing in the parse() method.
Suppose we wanted to make a really simple parser for term queries, which are space-delimited. Here's how I'd do it:
@Override public Query parse() throws SyntaxError { String defaultField = req.getSchema().getDefaultSearchFieldName(); QueryParser.Operator defaultOperator = QueryParser.Operator.valueOf(req.getSchema().getQueryParserDefaultOperator()); BooleanClause.Occur op = (defaultOperator == QueryParser.Operator.AND) ? BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD; String[] arr = qstr.split(" "); BooleanQuery bq = new BooleanQuery(true); for(String s: arr) { if(s.trim().length() == 0) continue; bq.add(new TermQuery(new Term(defaultField, s)), op); } return bq; }
Step 4
In your query, use the nested query syntax to call your queryparser, e.g.
http://localhost:8983/solr/collection1/select?q={!myfunparser}foo+bar+car
Maybe in a follow-up post, I'll post the full code with jars and all.
High-level overview of Latent Semantic Analysis / LSA
Posted by Kelvin on 09 Sep 2013 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch
I've just spent the last couple days wrapping my head around implementing Latent Semantic Analysis, and after wading through a number of research papers and quite a bit of linear algebra, I've finally emerged on the other end, and thought I'd write something about it to lock the knowledge in. I'll do my best to keep it non-technical, yet accurate.
Step One – Build the term-document matrix
Input : documents
Output : term-document matrix
Latent Semantic Analysis has the same starting point as most Information Retrieval algorithms : the term-document matrix. Specifically, columns are documents, and rows are terms. If a document contains a term, then the value of that row-column is 1, otherwise 0.
If you start with a corpus of documents, or a database table or something, then you'll need to index this corpus into this matrix. Meaning, lowercasing, removing stopwords, maybe stemming etc. The typical Lucene/Solr analyzer chain, basically.
Step Two – Decompose the matrix
Input : term-document matrix
Output : 3 matrices, U, S and V
Apply Singular Value Decomposition (SVD) to the matrix. This is the computationally expensive step of the whole operation.
SVD is a fairly technical concept and quite an involved process (if you doing it by hand). If you do a bit of googling, you're going to find all kinds of mathematical terms related to this, like matrix decomposition, eigenvalues, eigenvectors, PCA (principal component analysis), random projection etc.
The 5 second explanation of this step is that the original term-document matrix gets broken down into 3 simpler matrices: a term-term matrix (also known as U, or the left matrix), a matrix comprising of the singular values (also known as S), and a document-document matrix (also known as V, or the right matrix).
Something which usually also happens in the SVD step for LSA, and which is important, is rank reduction. In this context, rank reduction means that the original term-document matrix gets somehow "factorized" into its constituent factors, and the k most significant factors or features are retained, where k is some number greater than zero and less than the original size of the term-document matrix. For example, a rank 3 reduction means that the 3 most significant factors are retained. This is important for you to know because most LSA/LSI applications will ask you to specify the value of k, meaning the application wants to know how many features you want to retain.
So what's actually happening in this SVD rank reduction, is basically an approximation of the original term-document matrix, allowing you to compare features in a fast and efficient manner. Smaller k values generally run faster and use less memory, but are less accurate. Larger k values are more "true" to the original matrix, but require longer to compute. Note: this statement may not be true of the stochastic SVD implementations (involving random projection or some other method), where an increase in k doesn't lead to a linear increase in running time, but more like a log(n) increase in running time.
Step Three – Build query vector
Input : query string
Output : query vector
From here, we're on our downhill stretch. The query string needs to be expressed in terms that allow for searching.
Step Four – Compute cosine distance
Input : query vector, document matrix
Output : document scores
To obtain how similar each document is to the query, aka the doc score, we have to go through each document vector in the matrix and calculate its cosine distance to the query vector.
Voila!!
Naive Solr Did You Mean re-searcher SearchComponent
Posted by Kelvin on 05 Sep 2013 | Tagged as: Lucene / Solr / Elasticsearch / Nutch
Solr makes Spellcheck easy. Super-easy in fact. All you need to do is to change some stuff in solrconfig.xml, and voila, spellcheck suggestions!
However, that's not how google does spellchecking. What Google does is determine if the query has a mis-spelling, and if so, transparently correct the misspelled term for you and perform the search, but also giving you the option of searching for the original term via a link.
Now, whilst it'd be uber-cool to have an exact equivalent in Solr, you'd need some statistical data to be able to perform this efficiently. A naive version is to use spellcheck corrections to transparently perform a new query when the original query returned less than x hits, where x is some arbitrarily small number.
Here's a simple SearchComponent that does just that:
import org.apache.solr.common.util.NamedList; import org.apache.solr.handler.component.QueryComponent; import org.apache.solr.handler.component.ResponseBuilder; import java.io.IOException; public class AutoSpellcheckResearcher extends QueryComponent { // if less than *threshold* hits are returned, a re-search is triggered private int threshold = 0; @Override public void init(NamedList args) { super.init(args); this.threshold = (Integer) args.get("threshold"); } @Override public void prepare(ResponseBuilder rb) throws IOException { } @Override public void process(ResponseBuilder rb) throws IOException { long hits = rb.getNumberDocumentsFound(); if (hits <= threshold) { final NamedList responseValues = rb.rsp.getValues(); NamedList spellcheckresults = (NamedList) responseValues.get("spellcheck"); if (spellcheckresults != null) { NamedList suggestions = (NamedList) spellcheckresults.get("suggestions"); if (suggestions != null) { final NamedList collation = (NamedList) suggestions.get("collation"); if (collation != null) { String collationQuery = (String) collation.get("collationQuery"); if (responseValues != null) { responseValues.add("researched.original", rb.getQueryString()); responseValues.add("researched.replaced", collationQuery); responseValues.remove("response"); } rb.setQueryString(collationQuery); super.prepare(rb); super.process(rb); } } } } } @Override public String getDescription() { return "AutoSpellcheckResearcher"; } @Override public String getSource() { return "1.0"; } }
Reading ElasticSearch server book…
Posted by Kelvin on 23 May 2013 | Tagged as: Lucene / Solr / Elasticsearch / Nutch
Just got on my hands on a review copy of PacktPub's ElasticSearch Server book, which I believe is the first ES book on the market.
Review to follow shortly..
New file formats in Lucene 4.1+ index
Posted by Kelvin on 30 Apr 2013 | Tagged as: Lucene / Solr / Elasticsearch / Nutch
Lucene 4.1 introduces new files in the index.
Here's a link to the documentation: https://builds.apache.org/job/Lucene-Artifacts-trunk/javadoc/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html
The different types of files are:
.tim: Term Dictionary
.tip: Term Index
.doc: Frequencies and Skip Data
.pos: Positions
.pay: Payloads and Offsets
Permission filtering in Solr using an ACL permissions string
Posted by Kelvin on 03 Apr 2013 | Tagged as: Lucene / Solr / Elasticsearch / Nutch
For an app I'm working on, permissions ACL is stored in a string, in the form:
category1=100|category2=300|category3=300
Both users and documents have an ACL string.
The number represents the access level for that category. Bigger numbers mean higher access.
In the previous Lucene-based iteration, to perform permission filtering, I just loaded the entire field into memory and did quick in-memory lookups. In this current iteration, I'm trying something different.
I'm creating a one field per category level, and populating the field values accordingly. Then when searching, I need to search for all the possible categories using range queries, including specifying empty fields where applicable. Works pretty well. The main drawback (and its a severe one), is that I need to know a priori all the categories. This is not a problem for this app, but might be for other folks.
Here's an example of how it looks.
Document A: user=300|moderator=100
maps to
acl_user:300
acl_moderator:100
User A: moderator=300
Filter Query to determine if User A can access Document A:
-acl_user:[* TO *] acl_moderator:[0 T0 300]
Solr DateField java dateformat
Posted by Kelvin on 03 Apr 2013 | Tagged as: Lucene / Solr / Elasticsearch / Nutch
Grrrr… keep forgetting the Solr DateField dateformat, so here it is for posterity.
new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
Java port of Quicksilver-style Live Search
Posted by Kelvin on 19 Nov 2012 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch
Here's a straight Java port of the quicksilver algo, found here: http://orderedlist.com/blog/articles/live-search-with-quicksilver-style-for-jquery/
quicksilver.js contains the actual algorithm in javascript.
It uses the same input strings as the demo page at http://static.railstips.org/orderedlist/demos/quicksilverjs/jquery.html
import java.io.IOException; import java.util.TreeSet; public class Quicksilver { public static void main(String[] args) throws IOException { for (ScoreDoc doc : getScores("DGHTD")) System.out.println(doc); System.out.println("============================================"); for (ScoreDoc doc : getScores("Web")) System.out.println(doc); System.out.println("============================================"); for (ScoreDoc doc : getScores("jhn nnmkr")) System.out.println(doc); System.out.println("============================================"); for (ScoreDoc doc : getScores("wp")) System.out.println(doc); } public static TreeSet<ScoreDoc> getScores(String term) { term = term.toLowerCase(); TreeSet<ScoreDoc> scores = new TreeSet<ScoreDoc>(); for (int i = 0; i < cache.length; i++) { float score = score(cache[i], term, 0); if (score > 0) { scores.add(new ScoreDoc(score, i)); } } return scores; } public static float score(String str, String abbreviation, int offset) { // int offset ? offset : 0 // TODO: I think this is unused... remove if (abbreviation.length() == 0) return 0.9f; if (abbreviation.length() > str.length()) return 0.0f; for (int i = abbreviation.length(); i > 0; i--) { String sub_abbreviation = abbreviation.substring(0, i); int index = str.indexOf(sub_abbreviation); if (index < 0) continue; if (index + abbreviation.length() > str.length() + offset) continue; String next_string = str.substring(index + sub_abbreviation.length()); String next_abbreviation = null; if (i >= abbreviation.length()) next_abbreviation = ""; else next_abbreviation = abbreviation.substring(i); float remaining_score = score(next_string, next_abbreviation, offset + index); if (remaining_score > 0) { float score = str.length() - next_string.length(); if (index != 0) { int j = 0; char c = str.charAt(index - 1); if (c == 32 || c == 9) { for (j = (index - 2); j >= 0; j--) { c = str.charAt(j); score -= ((c == 32 || c == 9) ? 1 : 0.15); } // XXX maybe not port str heuristic // // } else if ([[NSCharacterSet uppercaseLetterCharacterSet] characterIsMember:[self characterAtIndex:matchedRange.location]]) { // for (j = matchedRange.location-1; j >= (int) searchRange.location; j--) { // if ([[NSCharacterSet uppercaseLetterCharacterSet] characterIsMember:[self characterAtIndex:j]]) // score--; // else // score -= 0.15; // } } else { score -= index; } } score += remaining_score * next_string.length(); score /= str.length(); return score; } } return 0.0f; } public static class ScoreDoc implements Comparable<ScoreDoc> { public float score; public int doc; public String term; public ScoreDoc(float score, int doc) { this.score = score; this.doc = doc; this.term = cache[doc]; } public int compareTo(ScoreDoc o) { if (o.score < score) return -1; if (o.score > score) return 1; return 0; } @Override public boolean equals(Object o) { if (this == o) return true; if (o == null || getClass() != o.getClass()) return false; ScoreDoc scoreDoc = (ScoreDoc) o; if (doc != scoreDoc.doc) return false; if (Float.compare(scoreDoc.score, score) != 0) return false; return true; } @Override public int hashCode() { int result = (score != +0.0f ? Float.floatToIntBits(score) : 0); result = 31 * result + doc; return result; } @Override public String toString() { final StringBuilder sb = new StringBuilder(); sb.append("ScoreDoc"); sb.append("{score=").append(score); sb.append(", doc=").append(doc); sb.append(", term='").append(term).append('\''); sb.append('}'); return sb.toString(); } } public static String[] cache = new String[]{ "The Well-Designed Web", "Welcome John Nunemaker", "Sidebar Creative: The Next Steps", "The Web/Desktop Divide", "2007 in Review", "Don't Complicate the Solution", "Blog to Business", "Single Line CSS", "Comments Work Again", "The iPhone Effect", "Greek Blogger Camp", "FeedBurner FeedSmith", "Download Counter Update 1.3", "Branding Reworked", "Productivity and Fascination", "Passing the Torch", "Goodbye Austin", "Ordered Shirts", "Sidebar Creative", "Building the Modern Web", "Open for Business", "The Art and Science of CSS", "WP Tiger Administration v3.0", "Cleaning House", "Tiger Admin 3.0 Beta Testing", "Rails and MVC", "Updates and More", "FeedBurner Plugin v2.1 Released", "The Global Health Crisis", "WP FeedBurner v2.1 Beta", "Web Development and Information Technology", "On Becoming a Dad", "Tiger Admin and Shuttle", "Staying Small in a Big Place: Part 1", "WaSP eduTF Interview", "Planned Parenthood", "IE7 and Clearing Floats", "SXSWi 2006: Dan Gilbert - How To Do Exactly the Right Thing at All Possible Times", "SXSWi 2006: Traditional Design and New Technology", "SXSWi 2006: Almost There", "HOWTO: Animated Live Search", "Leaving Solo", "Tagged for Four Things", "Automotive Interface", "Another FeedBurner Plugin Update", "WP Tiger Admin 2.0", "WordPress FeedBurner Plugin for 2.0", "SXSWi 2006", "Statistical AJAX", "Semantics and Design", "Download Counter Update", "Best Buy, Worst Experience", "A Realign, or Whatever", "Stop with the Jargon", "10K+ for Tiger Plugin", "Flock and Integration", "Only the Beginning", "A Tip of the Hat", "3 Years", "Pepper: Download Counter", "Launch: Notre Dame College of Arts and Letters", "Innovation, Progress, and Imagination", "This Thing Here", "Ode", "Web Developer Opening", "WordPress Administration Design: Tiger", "SAJAX ColdFusion POST Request Method", "An Underscore Solution", "Google and the Underscore", "The Hand Off", "WordPress Upgrade and RSS", "WordPress FeedBurner Plugin", "Documentation Process", "WordPress Underscore Plugin", "CMS Release", "Two Suggestions for iTunes", "Call for Good Music", "A Change of Platform", "Point/Counterpoint: The Wrapper Div", "IE7 List, As Requested", "I'm a Switcher", "Breadcrumb Trails", "Output Code", "Bending the Matrix", "Children's Resource Group", "Do You Freelance?", "Project Management Software", "I Can't Stand It!", "Shiver Me Timbers!", "NDWG V1.0", "Dealing with IE5/Mac", "To All", "A Natural Progression", "Finishing the Basement", "Where Do You Live?", "The Recursion Project", "Clearing Floats: The FnE Method", "Ordered Zen", "Comment RSS", "Wordpress Code", "Writing Lean CSS", "v3.0 CMYK", "A Clean Slate", "Working for the Irish", "Excuse the Mess", "A Little Help", "Design Revisions", "Aloha", "FTOS Round 2", "I Love Storms", "One Gig?", "AD:TECH 2004 Chicago", "Thanks and Response", "OrderedList.com v2.0", "Skuzzy Standards", "Simple List", "Anger Management", "A Practical Start to Web Standards", "Irony and Progress", "The Familiar Chirping of Crickets", "Results of FTOS Round 1", "Figure This Out, Steve", "Increasing Developer Productivity", "One Down", "Content Management Your Way", "We Have Liftoff", "The Great Divide", "What's in a Name?", "Just How Important is Validation?"}; static{ for (int i = 0, n = cache.length; i < n; i++) { cache[i] = cache[i].toLowerCase(); } } }
Apache Solr vs ElasticSearch – the website
Posted by Kelvin on 14 Nov 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch
Just spent the day hacking together a website that does a blow-by-blow examination of Solr vs ElasticSearch.
Hopefully it'll address any questions people might have about whether to use Solr or ES..
Let me know what you think!
The anatomy of a Lucene Tokenizer
Posted by Kelvin on 12 Nov 2012 | Tagged as: Lucene / Solr / Elasticsearch / Nutch
A term is the unit of search in Lucene. A Lucene document comprises of a set of terms. Tokenization means splitting up a string into tokens, or terms.
A Lucene Tokenizer is what both Lucene (and correspondingly, Solr) uses to tokenize text.
To implement a custom Tokenizer, you extend org.apache.lucene.analysis.Tokenizer.
The only method you need to implement is public boolean incrementToken(). incrementToken returns false for EOF, true otherwise.
Tokenizers generally take a Reader input in the constructor, which is the source to be tokenized.
With each invocation of incrementToken(), the Tokenizer is expected to return new tokens, by setting the values of TermAttributes. This happens by adding TermAttributes to the superclass, usually as fields in the Tokenizer. e.g.
public class MyCustomTokenizer extends Tokenizer { private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
Here, a CharTermAttribute is added to the superclass. A CharTermAttribute stored the term text.
Here's one way to set the value of the term text in incrementToken().
public boolean incrementToken() { if(done) return false; done = true; int upto = 0; char[] buffer = new char[512]; while (true) { final int length = input.read(buffer, upto, buffer.length - upto); // input is the reader set in the ctor if (length == -1) break; upto += length; sb.append(buffer); } termAtt.append(sb.toString()); return true; }
And that's pretty much all you need to start writing custom Lucene tokenizers!