Dynamic facet population with Solr DataImportHandler
Posted by Kelvin on 02 Aug 2010 at 02:03 pm | Tagged as: Lucene / Solr / Elasticsearch / Nutch, programming
Here's what I'm trying to do:
Given this mysql table:
CREATE TABLE `tag` ( `id` INTEGER AUTO_INCREMENT NOT NULL PRIMARY KEY, `name` VARCHAR(100) NOT NULL UNIQUE, `category` VARCHAR(100) ); INSERT INTO tag (name,category) VALUES ('good','foo'); INSERT INTO tag (name,category) VALUES ('awe-inspiring','foo'); INSERT INTO tag (name,category) VALUES ('mediocre','bar'); INSERT INTO tag (name,category) VALUES ('terrible','car');
and this solr schema
<field name="tag-foo" type="string" indexed="true" stored="true" multiValued="true"/> <field name="tag-bar" type="string" indexed="true" stored="true" multiValued="true"/> <field name="tag-car" type="string" indexed="true" stored="true" multiValued="true"/>
to populate these tag fields via DataImportHandler.
The dumb (but straightforward) way to do it is to use sub-entities, but this is terribly expensive since you use one extra SQL query per category.
Solution
My general approach was to concatenate the rows into a single row, then use RegexTransformer and a custom dataimport Transformer to split out the values.
Here's how I did it:
My dataimporthandler xml:
<entity name="tag-facets" transformer="RegexTransformer,org.supermind.solr.TagFacetsTransformer" query="select group_concat(concat(t.category,'=',t.name) separator '#') as tagfacets from tag t,booktag bt where bt.id='${book.id}' and t.category is not null"> <field column="tagfacets" splitBy="#"/> </entity>
You'll see that a temporary field tagfacets is used. This will be deleted later on in TagFacetsTransformer.
package org.supermind.solr; import org.apache.solr.handler.dataimport.Context; import org.apache.solr.handler.dataimport.Transformer; import java.util.List; import java.util.Map; public class TagFacetsTransformer extends Transformer { public Object transformRow(Map<String, Object> row, Context context) { Object tf = row.get("tagfacets"); if (tf != null) { if (tf instanceof List) { List list = (List) tf; for (Object o : list) { String[] arr = ((String) o).split("="); if (arr.length == 2) row.put("tag-" + arr[0], arr[1]); } } else { String[] arr = ((String) tf).split("="); if (arr.length == 2) row.put("tag-" + arr[0], arr[1]); } row.remove("tagfacets"); } return row; } }
Here's the output via DIH's verbose output (with my own data):
<str name="tagfacets">lang=ruby#framework=ruby-on-rails</str> <str>---------------------------------------------</str> <lst name="transformer:RegexTransformer"> <str>---------------------------------------------</str> <arr name="tagfacets"> <str>lang=ruby</str> <str>framework=ruby-on-rails</str> </arr> <str>---------------------------------------------</str> <lst name="transformer:org.supermind.solr.TagFacetsTransformer"> <str>---------------------------------------------</str> <str name="tag-framework">ruby-on-rails</str> <str name="tag-lang">ruby</str> <str>---------------------------------------------</str> </lst> </lst> </lst>
You can see the step-by-step transformation of the input value.
Pretty nifty, eh?