Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about programming

Friendlier Thunderbird date columns

Posted by Kelvin on 24 Jan 2010 | Tagged as: programming

In Thunderbird, there's a way to customize the display of the date header in your inbox.

Short answer: Go http://kb.mozillazine.org/Date_display_format and follow instructions

Long answer:

1. Edit > Preferences > Advanced > Config Editor
2. Type 'dateformat' in the search box and hit enter.
3. If the following 3 entries do not appear, you'll have to create them.

mail.ui.display.dateformat.today
mail.ui.display.dateformat.thisweek
mail.ui.display.dateformat.default

4. Right click in the empty space of the results pane. A menu will appear.
New > Integer
Preference name: mail.ui.display.dateformat.today
Preference value: 0
New > Integer
Preference name: mail.ui.display.dateformat.thisweek
Preference value: 4
New > Integer
Preference name: mail.ui.display.dateformat.default
Preference value: 2

5. If you already have these config entries, just change the values to match what I have above.

What this does is:

1. For messages received today, displays only the time (e.g. 10:15am):.
2. For messages received this week, displays day and time (e.g. Friday 10:15am)
3. For all other messages, use the long form (mm/dd/yyyy)

dom4j.org – WTF?

Posted by Kelvin on 22 Jan 2010 | Tagged as: programming

dom4j is one of the better XML parsing Java libraries out there.

Its released under the uber-liberal BSD license, and is the brainchild of James Strachan, also of Jelly and Groovy fame.

Yesterday I was working on some dom4j stuff, and noticed that www.dom4j.org (I'm not going to link to it) is no longer a mirror of the original http://dom4j.sourceforge.net.

Instead, its been taken over by some SEO assholes in belgium (www.yxymedia.com) who have made a visual clone of the dom4j look-and-feel, but have changed it to be about "making your own website. The headers now read "DOM4J – Making Your Own Site".

WTF??!!

There's no 2 ways around it. This is unethical, misleading, and embarrasing.

Guys @ yxymedia, if you read this, please stop.

CSS3 Selectors in Java

Posted by Kelvin on 21 Jan 2010 | Tagged as: programming

http://github.com/chrsan/css-selectors/tree has a cool implementation of full CSS3 Selector support.

Yes, this is the same CSS selector support as you get in JQuery. Eat your heart out.

It comes with a org.w3c.dom implementation out-of-box.

I augmented it with a dom4j implementation (so I could mixin tagsoup for real-world HTML).

It's slow as a dog compared with native xpath or regex, but its still cool nonetheless.

Update: OK, I was wrong about performance. My initial dom4j implementation was slow BECAUSE of xpath actually. When I changed it to use dom4j node traversal methods, performance increased by over 50x. I'm happy with performance now.

Dom4j + XPath + TagSoup – Namespaces = sweet!

Posted by Kelvin on 20 Jan 2010 | Tagged as: programming

TagSoup does this annoying thing of adding namespaces to the html it cleans.

This annoyance becomes a major hindrance when formulating XPath queries for tagsoup-cleaned html.

Instead of using

//body/a/@href

we have to do

//html:body/html:a/@href

I spent a couple hours trying to figure out how to disable namespace prefixes in TagSoup.

This does not work:

parser.setFeature(org.ccil.cowan.tagsoup.Parser.namespacesFeature, false);

This doesn't work either:

parser.setFeature(org.ccil.cowan.tagsoup.Parser.namespacePrefixesFeature, false);

Finally stumbled on a crude bruteforce solution at http://www.mail-archive.com/dom4j-user%40lists.sourceforge.net/msg02511.html

/**
     *Removes namespaces if removeNamespaces is true
     */   
    public static void fixNamespaces(Document doc){
        Element root = doc.getRootElement();       
        if(removeNamespaces && root.getNamespace() != 
Namespace.NO_NAMESPACE) removeNamespaces( root.content() );               
    }
 
    /**
     *Puts the namespaces back to the original root if removeNamespaces 
is true
     */   
    public static void unfixNamespaces(Document doc, Namespace original){
        Element root = doc.getRootElement();
        if(removeNamespaces && original != null) 
setNamespaces(root.content(), original);
    }
 
    /**
     *Sets the namespace of the element to the given namespace
     */
    public static void setNamespace(Element elem, Namespace ns){
        elem.setQName( QName.get( elem.getName(), ns, 
elem.getQualifiedName() ) );
    }
 
    /**
     *Recursively removes the namespace of the element and all its 
children: sets to Namespace.NO_NAMESPACE
     */
    public static void removeNamespaces(Element elem){
        setNamespaces(elem, Namespace.NO_NAMESPACE);
    }
 
    /**
     *Recursively removes the namespace of the list and all its 
children: sets to Namespace.NO_NAMESPACE
     */
    public static void removeNamespaces(List l){
        setNamespaces(l, Namespace.NO_NAMESPACE);
    }
 
    /**
     *Recursively sets the namespace of the element and all its children.
     */
    public static void setNamespaces(Element elem, Namespace ns){
        setNamespace(elem, ns);
        setNamespaces(elem.content(), ns);
    }
 
    /**
     *Recursively sets the namespace of the List and all children if the 
current namespace is match
     */
    public static void setNamespaces(List l, Namespace ns){
        Node n = null;
        for(int i=0; i<l.size(); i++){
            n = (Node)l.get(i);
            if(n.getNodeType() == Node.ATTRIBUTE_NODE) ( (Attribute)n 
).setNamespace(ns);
            if(n.getNodeType() == Node.ELEMENT_NODE) setNamespaces( 
(Element)n, ns );
        }
    }

Grrrrrrr….. but at least we can say goodbye to prefixes in xpath queries.

Using expressions to assign PHP static variables

Posted by Kelvin on 14 Jan 2010 | Tagged as: programming, PHP

OK. The PHP manual explicitly states you CANNOT use an expression when assigning to a static variable.

You can, however, do this:

class MyClass {
  public static $a = 1;
  public static $b;

  public static function init() {
    self::$b = self::$a + 1;
  }
}
MyClass::init();

Nifty eh?

Handling single query multiple ResultSets in MySQL and JDBC

Posted by Kelvin on 14 Jan 2010 | Tagged as: programming

I've used JDBC with MySQL forever, but funnily enough, never tried issuing multiple statements in a single query, which results in multiple resultsets.

If you ever get this SQLException ResultSet is from UPDATE. No Data., then read on my friend.

Here's the lowdown:

1. Add ?allowMultiQueries=true to your JDBC URL, like so

jdbc:mysql://localhost/mydatabase?allowMultiQueries=true

Note: if you don't perform this step, the MySQL JDBC driver doesn't tell you you need to. It just complains with the usual syntax blah:

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version...

2. Make your usual JDBC connection plumbing, then create a Statement object and call execute().

Statement stmt = conn.createStatement();
stmt.execute(sql);

3. Now the fun part. Since you're issuing multiple statements, some may return resultsets, and others may not.

The 3 methods that's going to help us navigate the multiple resultsets are getUpdateCount(), getMoreResults() and getResultSet(). Here's one loop that binds them all.

 while(true) {
      if(stmt.getUpdateCount() > -1) {
        stmt.getMoreResults();
        continue;
      }
      if(stmt.getResultSet() == null) break;
      ResultSet rs = stmt.getResultSet();
      while (rs.next()) {
        // do something
      }
    }

Solr 1.4.. but no fastvectorhighlighter

Posted by Kelvin on 10 Dec 2009 | Tagged as: programming

Solr 1.4 has been released. OK. its old news. Exactly one month old actually.

However… the release doesn't include Lucene's FastVectorHighlighter.

I ended up writing my own simple plumbing code to fillin the gap for now.

My own informal testing showed a 40-60% decrease in highlighting times for largish (1MB+ in size) documents. Definitely impressive.

java.net.URL synchronization bottleneck

Posted by Kelvin on 08 Dec 2009 | Tagged as: programming, crawling

This is interesting because I haven't found anything on google about it.

There's a static Hashtable in java.net.URL (urlStreamHandlers) which gets invoked with every constructor call. Well, turns out when you're running a crawler with, say 50 threads, that turns out to be a major bottleneck.

Of 70 threads, I had running, 48 were blocked on the java.net.URL ctor. I was using the URL class for resolving relative URLs to absolute ones.

Since I had previously written a URL parser to parse out the parts of a URL, I went ahead and implemented my own URL resolution function.

Went from

Status: 12.407448 pages/s, 207.06316 kb/s, 2136.143 bytes/page

to

Status: 43.9947 pages/s, 557.29156 kb/s, 1621.4071 bytes/page

after increasing the number of threads to 100 (which would not have made much difference in the java.net.URL implementation).

Cool stuff.

TokyoCabinet Linux Install Script

Posted by Kelvin on 06 Dec 2009 | Tagged as: Ubuntu, programming

Updated on Mar 22 2011 for latest versions

export JAVA_HOME=/usr/lib/jvm/current #changeme!
export MYJAVAHOME=$JAVA_HOME
 
wget http://1978th.net/tokyocabinet/tokyocabinet-1.4.47.tar.gz
tar -zxvf tokyocabinet-1.4.47.tar.gz
cd tokyocabinet-1.4.47
./configure --enable-off64 --prefix=/usr
make && sudo make install
cd ..
 
wget http://1978th.net/tokyocabinet/javapkg/tokyocabinet-java-1.24.tar.gz
tar -zxvf tokyocabinet-java-1.24.tar.gz
cd tokyocabinet-java-1.24
./configure --prefix=/usr
make && sudo make install
cd ..

You may need bzip2-devel + zlib (RH/Fedora) or libbz2-dev + zlib1g-dev (Debian/Ubuntu) installed before running configure.

Don't worry about the second bit if you don't need the java bindings.

TokyoCabinet Installation snafu on Fedora

Posted by Kelvin on 06 Dec 2009 | Tagged as: programming

Just installed TokyoCabinet on Fedora. Installation went like a breeze. Except..when running the Java app that uses TC, it complained about an UnsatisfiedLinkError:

libtokyocabinet.so.9: cannot open shared object file: No such file or directory – /usr/lib

Thanks to http://jibbajabba.info/ who in turn credits http://www.machinelake.com/2009/03/22/nerding-out-with-ruby-tokyo-cabinet-hpricot-twitter-sinatra-haml-passenger/

The answer is simple:

ldconfig /usr/lib

or

ldconfig /usr/local/lib

depending on where you installed TC.

« Previous PageNext Page »