Dom4j + XPath + TagSoup – Namespaces = sweet!
Posted by Kelvin on 20 Jan 2010 at 04:02 pm | Tagged as: programming
TagSoup does this annoying thing of adding namespaces to the html it cleans.
This annoyance becomes a major hindrance when formulating XPath queries for tagsoup-cleaned html.
Instead of using
//body/a/@href
we have to do
//html:body/html:a/@href
I spent a couple hours trying to figure out how to disable namespace prefixes in TagSoup.
This does not work:
parser.setFeature(org.ccil.cowan.tagsoup.Parser.namespacesFeature, false);
This doesn't work either:
parser.setFeature(org.ccil.cowan.tagsoup.Parser.namespacePrefixesFeature, false);
Finally stumbled on a crude bruteforce solution at http://www.mail-archive.com/dom4j-user%40lists.sourceforge.net/msg02511.html
/** *Removes namespaces if removeNamespaces is true */ public static void fixNamespaces(Document doc){ Element root = doc.getRootElement(); if(removeNamespaces && root.getNamespace() != Namespace.NO_NAMESPACE) removeNamespaces( root.content() ); } /** *Puts the namespaces back to the original root if removeNamespaces is true */ public static void unfixNamespaces(Document doc, Namespace original){ Element root = doc.getRootElement(); if(removeNamespaces && original != null) setNamespaces(root.content(), original); } /** *Sets the namespace of the element to the given namespace */ public static void setNamespace(Element elem, Namespace ns){ elem.setQName( QName.get( elem.getName(), ns, elem.getQualifiedName() ) ); } /** *Recursively removes the namespace of the element and all its children: sets to Namespace.NO_NAMESPACE */ public static void removeNamespaces(Element elem){ setNamespaces(elem, Namespace.NO_NAMESPACE); } /** *Recursively removes the namespace of the list and all its children: sets to Namespace.NO_NAMESPACE */ public static void removeNamespaces(List l){ setNamespaces(l, Namespace.NO_NAMESPACE); } /** *Recursively sets the namespace of the element and all its children. */ public static void setNamespaces(Element elem, Namespace ns){ setNamespace(elem, ns); setNamespaces(elem.content(), ns); } /** *Recursively sets the namespace of the List and all children if the current namespace is match */ public static void setNamespaces(List l, Namespace ns){ Node n = null; for(int i=0; i<l.size(); i++){ n = (Node)l.get(i); if(n.getNodeType() == Node.ATTRIBUTE_NODE) ( (Attribute)n ).setNamespace(ns); if(n.getNodeType() == Node.ELEMENT_NODE) setNamespaces( (Element)n, ns ); } }
Grrrrrrr….. but at least we can say goodbye to prefixes in xpath queries.