Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

TokyoCabinet Linux Install Script

Posted by Kelvin on 06 Dec 2009 | Tagged as: programming, Ubuntu

Updated on Mar 22 2011 for latest versions export JAVA_HOME=/usr/lib/jvm/current #changeme! export MYJAVAHOME=$JAVA_HOME   wget http://1978th.net/tokyocabinet/tokyocabinet-1.4.47.tar.gz tar -zxvf tokyocabinet-1.4.47.tar.gz cd tokyocabinet-1.4.47 ./configure –enable-off64 –prefix=/usr make && sudo make install cd ..   wget http://1978th.net/tokyocabinet/javapkg/tokyocabinet-java-1.24.tar.gz tar -zxvf tokyocabinet-java-1.24.tar.gz cd tokyocabinet-java-1.24 ./configure –prefix=/usr make && sudo make install cd .. You may need bzip2-devel + zlib (RH/Fedora) […]

TokyoCabinet Installation snafu on Fedora

Posted by Kelvin on 06 Dec 2009 | Tagged as: programming

Just installed TokyoCabinet on Fedora. Installation went like a breeze. Except..when running the Java app that uses TC, it complained about an UnsatisfiedLinkError: libtokyocabinet.so.9: cannot open shared object file: No such file or directory – /usr/lib Thanks to http://jibbajabba.info/ who in turn credits http://www.machinelake.com/2009/03/22/nerding-out-with-ruby-tokyo-cabinet-hpricot-twitter-sinatra-haml-passenger/ The answer is simple: ldconfig /usr/lib or ldconfig /usr/local/lib depending on […]

Ant, subant and basedir troubles

Posted by Kelvin on 11 Nov 2009 | Tagged as: programming

I just finished setting up a multi-project Ant build system and thought I'd blog about it. My build requirements were exactly what http://www.exubero.com/ant/dependencies.html described, so I followed the recommendations pretty much to the letter. However, it bombed in one area. One of my projects had a sub-folder where there were sub-projects and I called their […]

Average length of a URL

Posted by Kelvin on 06 Nov 2009 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, crawling, programming

Aug 16 update: I ran a more comprehensive analysis with a more complete dataset. Find out the new figures for the average length of a URL I've always been curious what the average length of a URL is, mostly when approximating memory requirements of storing URLs in RAM. Well, I did a dump of the […]

TokyoCabinet HDB slowdown

Posted by Kelvin on 10 Oct 2009 | Tagged as: work, programming

http://www.dmo.ca/blog/benchmarking-hash-databases-on-large-data/ reported that with a large number of records, puts become increasingly slower. I experienced a similar phenomenon, and just stumbled upon http://parand.com/say/index.php/2009/04/09/tokyo-cabinet-observations/ , where I realized my problem was with bnum being too small (default of 128k). According to docs, bnum is number of elements of the bucket array. If it is not more […]

2GB limit with Tokyo Cabinet (aka invalid meta data)

Posted by Kelvin on 28 Sep 2009 | Tagged as: programming

Ran into a bunch of frustrating problems with Tokyo Cabinet recently. When my hash database was reaching 2GB in size, the datafile would become corrupt. What's scary is that at that stage, writes to the database seemingly disappeared into the void. When reopening this corrupt DB, I'd get an "invalid meta data" exception. I googled […]

LightVC – a simple and elegant PHP framework

Posted by Kelvin on 28 Sep 2009 | Tagged as: programming, PHP

Whilst working on a recent project involving clinical trials, I stumbled on LightVC, a php framework. Yes.. yet ANOTHER php framework. Its emphasis on simplicity and minimalism caught my eye and I decided to give it a whirl. 3 months later, I have to admin I'm a total fan. It makes the simple stuff easy, […]

Spatial searching multiple locations per Solr/Lucene doc

Posted by Kelvin on 09 Sep 2009 | Tagged as: programming

There are a number of solutions for geosearching/spatial search in Lucene and Solr. – LocalLucene and LocalSolr are excellent options. – LuceneTutorial.com describes a partial solution for doing it the old-school way using FieldCache and a custom lucene query. – The new TrieRange feature introduced by Uwe Schindler also offers a new way of performing […]

Idea: 2-stage recovery of corrupt Solr/Lucene indexes

Posted by Kelvin on 09 Sep 2009 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, programming

I was recently onsite with a client who happened to have a corrupt Solr/Lucene index. The CheckIndex tool (lucene 2.4+) diagnosed the problem, and gave the option of fixing it. Except… fixing the index in this case meant losing the corrupt segment, which also happened to be the one containing over 90% of documents. Because […]

Benchmarks for various approaches to serializing java objects

Posted by Kelvin on 10 Jul 2009 | Tagged as: programming

http://www.eishay.com/2009/03/more-on-benchmarking-java-serialization.html

« Previous PageNext Page »