Simplistic noun-phrase chunking with POS tags in Java
Posted by Kelvin on 16 Jun 2012 at 05:18 pm | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch
I needed to extract Noun-Phrases from text. The way this is generally done is using Part-of-Speech (POS) tags. OpenNLP has a both a POS-tagger as well as a Noun-Phrase chunker. However, it's really really really slow!
I decided to look into alternatives, and chanced upon QTag.
QTag is a "freely available, language independent POS-Tagger. It is implemented in Java, and has been successfully tested on Mac OS X, Linux, and Windows."
It's waaay faster than OpenNLP for POS-tagging, though I haven't done any benchmarks as to a accuracy.
Here's my really simplistic but adequate implementation of noun-phrase chunking using QTag.
private Qtag qt; public static List<String> chunkQtag(String str) throws IOException { List<String> result = new ArrayList<String>(); if (qt == null) { qt = new Qtag("lib/english"); qt.setOutputFormat(2); } String[] split = str.split("\n"); for (String line : split) { String s = qt.tagLine(line, true); String lastTag = null; String lastToken = null; StringBuilder accum = new StringBuilder(); for (String token : s.split("\n")) { String[] s2 = token.split("\t"); if (s2.length < 2) continue; String tag = s2[1]; if (tag.equals("JJ") || tag.startsWith("NN") || tag.startsWith("??") || (lastTag != null && lastTag.startsWith("NN") && s2[0].equalsIgnoreCase("of")) || (lastToken != null && lastToken.equalsIgnoreCase("of") && s2[0].equalsIgnoreCase("the")) ) { accum.append(s2[0]).append("-"); } else { if (accum.length() > 0) { accum.deleteCharAt(accum.length() - 1); result.add(accum.toString()); accum = new StringBuilder(); } } lastTag = tag; lastToken = s2[0]; } if (accum.length() > 0) { accum.deleteCharAt(accum.length() - 1); result.add(accum.toString()); } } return result; }
The method returns a list of noun phrases.