Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about programming

100% height iframes

Posted by Kelvin on 30 Aug 2008 | Tagged as: programming

http://brondsema.net/blog/index.php/2007/06/06/100_height_iframe was a solution that worked for me after trying several out.

Using Hadoop IPC/RPC for distributed applications

Posted by Kelvin on 02 Jun 2008 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

Hadoop is growing to be a pretty large framework – release 0.17.0 has 483 classes!

Previously, I'd written about Hadoop SequenceFile. SequenceFile is part of the org.apache.hadoop.io package, the other notable useful classes in that package being ArrayFile and MapFile which are persistent array and dictionary data structures respectively.

About Hadoop IPC

Here, I'm going to introduce the Hadoop IPC, or Inter-Process Communication subsystem. Hadoop IPC is the underlying mechanism is used whenever there is a need for out-of-process applications to communicate with one another.

Hadoop IPC
1. uses binary serialization using java.io.DataOutputStream and java.io.DataInputStream, unlike SOAP or XML-RPC.
2. is a simple low-fuss RPC mechanism.
3. is unicast does not support multi-cast.

Why use Hadoop IPC over RMI or java.io.Serialization? Here's what Doug has to say:

Why didn't I use Serialization when we first started Hadoop? Because it looked big-and-hairy and I thought we needed something lean-and-mean, where we had precise control over exactly how objects are written and read, since that is central to Hadoop. With Serialization you can get some control, but you have to fight for it.

The logic for not using RMI was similar. Effective, high-performance inter-process communications are critical to Hadoop. I felt like we'd need to precisely control how things like connections, timeouts and buffers are handled, and RMI gives you little control over those.

Code!!!

I'm going to jump straight into some code examples to illustrate how easy Hadoop IPC can be.

Essentially, all unicast RPC calls involve a client and a server.

To create a server,

Configuration conf = new Configuration();
Server server = RPC.getServer(this, "localhost", 16000, conf);  // start a server on localhost:16000
server.start();

To create a client,

Configuration conf = new Configuration();
InetSocketAddress addr = new InetSocketAddress("localhost", 16000);  // the server's inetsocketaddress
ClientProtocol client = (ClientProtocol) RPC.waitForProxy(ClientProtocol.class, 
    ClientProtocol.versionID, addr, conf);

In this example, ClientProtocol is an interface implemented by the server class. ClientProtocol.java looks like this:

interface ClientProtocol extends org.apache.hadoop.ipc.VersionedProtocol {
  public static final long versionID = 1L;

  HeartbeatResponse heartbeat();
}

ClientProtocol defines a single method, heartbeat() which returns a HeartbeatResponse object. heartbeat() is the remote method called by the client which periodically lets the server know that it (the client) is still alive. The server then returns a HeartbeatResponse object which provides the client with some information.

Here's what HeartbeatResponse.java looks like:

public class HeartbeatResponse implements org.apache.hadoop.io.Writable {
  String status;

  public void write(DataOutput out) throws IOException {
    UTF8.writeString(out, status);
  }

  public void readFields(DataInput in) throws IOException {
    this.status = UTF8.readString(in);
  }
}

Summary

So, to summarize, you'll need
1. A server which implements an interface (which itself extends VersionedProtocol)
2. 1 or more clients which call the interface methods.
3. Any method arguments or objects returned by methods must implement org.apache.hadoop.io.Writable

TREC 2007 Million Queries Track

Posted by Kelvin on 10 May 2008 | Tagged as: programming

Just read about the IBM Haifa Team's experiences in tweaking Lucene relevance for TREC.

via Jeff's Search Engine Caffè

Lucene Tutorial.com

Posted by Kelvin on 25 Apr 2008 | Tagged as: programming

I've been maintaining a website dedicated to introducing Lucene to beginners.

Check it out here: http://www.lucenetutorial.com

Feedback is always welcome, including topics you'd like to see written on.

A Collection of JVM Options

Posted by Kelvin on 24 Apr 2008 | Tagged as: programming

Just found this collection of JVM options which might prove handy one day.

Limiting system cache size in Windows Server 2003

Posted by Kelvin on 24 Apr 2008 | Tagged as: programming

On a consulting gig, I was recently asked to investigate a strange problem with a Lucene server on Windows Server 2003.

The Lucene index was periodically refreshed by running a new instance of the app, then killing the old one via "taskkill". Worked fine, except the available memory displayed by Task Manager somehow steadily decreased with every app refresh, and it would run out of memory every 2-3 days. However, just by killing _all_ java processes, the memory would be magically reclaimed and available memory would immediately jump up to the correct amount.

Well, turns out that the culprit was Windows' file system cache, and the default Windows Server 2003 settings which gives priority to the System Cache over application processes. So we ended up with a huge file system cache and not enough memory to start a new application process.

Here are some links which were helpful in troubleshooting this problem, and its eventual solution.

Initially I was trying to find where the missing memory went, since the "Available physical memory" value didn't match process totals given to me by the "Processes" tab.
Everything you always wanted to know about Task Manager but were afraid to ask helped me determine that "Commit Charge" was what I was really interested in, and that value did indeed match process totals. So it wasn't some memory leak then.

After taking another look at Task Manager, I realized the biggest culprit was System Cache.

http://smallvoid.com/article/winnt-system-cache.html gives an overview of the Windows system cache, and, in particular, lists these tools:

Both CacheSet and SetSystemFileCacheSize worked in setting an upper limit on the file cache size, and that solved our problem of the missing memory.

Is Nutch appropriate for aggregation-type vertical search?

Posted by Kelvin on 24 Sep 2007 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

I get pinged all the time by people who tell me they want to build a vertical search engine with Nutch. The part I can't figure out, though, is why Nutch?

What's vertical anyway?

So let's start from basics. Vertical search engines typically fall into 2 categories:

  1. Whole-web search engines which selectively crawl the Internet for webpages related to a certain topic/industry/etc.
  2. Aggregation-type search engines which mine other websites and databases, aggregating data and repackaging it into a format which is easier to search.

Now, imagine a biotech company comes to me to develop a search engine for everything related to biotechnology and genetics. You'd have to crawl as many websites as you can, and only include the ones related to biotechnology in the search index.

How would I implement the crawler? Probably use Nutch for the crawling and modify it to only extract links from a page if the page contents are relevant to biotechnology. I'd probably need to write some kind of relevancy scoring function which uses a mixture of keywords, ontology and some kind of similarity detection based on sites we know a priori to be relevant.

Now, second scenario. Imagine someone comes to me and want to develop a job search engine for a certain country. This would involve indexing all jobs posted in the 4 major job websites, refreshing this database on a daily basis, checking for new jobs, deleting expired jobs etc.

How would I implement this second crawler? Use Nutch? No way! Ahhhh, now we're getting to the crux of this post..

The ubiquity of Lucene … and therefore Nutch

Nutch is one of two open-source Java crawlers out there, the other being Heritrix from the good guys at the Internet Archive. Its rode on Lucene as the default choice for full-text search API. Everyone who wants to build a vertical search engine in Java these days knows they're going to use Lucene as the search API, and naturally look to Nutch for the crawling side of things. And that's when their project runs into a brick wall…

To Nutch or not to Nutch

Nutch (and Hadoop) is a very very cool project with ambitious and praiseworthy goals. They're really trying to build an open-source version of Google (not sure if that actually is the explicitly declared aims).

Before jumping into any library or framework, you want to be sure you know what needs to be accomplished. I think this is the step many people skip: they have no idea what crawling is all about, so they try to learn what crawling is by observing what a crawler does. Enter Nutch.

The trouble is, observing/using Nutch isn't necessarily the best way to learn about crawling. The best way to learn about crawling is to build a simple crawler.

In fact, if you sit down and think what a 4 job-site crawler really needs to do, its not difficult to see that its functionality is modest and humble – in fact, I can write its algorithm out here:


for each site:
  if there is a way to list all jobs in the site, then
    page through this list, extracting job detail urls to the detail url database
  else if exists browseable categories like industry or geographical location, then
    page through these categories, extracting job detail urls to the detail url database
  else 
    continue
  for each url in the detail url database:
    download the url
    extract data into a database table according to predefined regex patterns

Won't be difficult to hack up something quick to do this, especially with the help of Commons HttpClient. You'll probably also want to make this app multi-threaded.

Other things you'll want to consider, is how many simultaneous threads to hit a server with, if you want to save the HTML content of pages vs just keeping the extracted data, how to deal with errors, etc.

All in all, I think you'll find that its not altogether overwhelming, and there's actually alot to be said for the complete control you have over the crawling and post-crawl extraction processes. Compare this to Nutch, where you'll need to fiddle with various configuration files (nutch-site.xml, urlfilters, etc), where calling apps from an API perspective is difficult, you'll have to work with the various file I/O structures to reach the content (SegmentFile, MapFile etc), various issues may prevent all urls from being fetched (retry.max being a common one), if you want custom crawl logic, you'll have to patch/fork the codebase (ugh!) etc.

The other thing that Nutch offers is an out-of-box search solution, but I personally have never found a compelling reason to use it – its difficult to add custom fields, adding OR phrase capability requires patching codebase, etc. In fact, I find it much much simpler to come up with my own SearchServlet.

Even if you decide not to come up with a homegrown solution, and you want to go with Nutch. Well, here's one other thing you need to know before jumping into Nutch.

To map-reduce, or not?

From Nutch 0.7 to Nutch 0.8, there was a pretty big jump in the code complexity with the inclusion of the map-reduce infrastructure. Map-reduce subsequently got factored out, together with some of the core distributed I/O classes into Hadoop.

For a simple example to illustrate my point, just take a look at the core crawler class, org.apache.nutch.fetcher.Fetcher, from the 0.7 branch, to the current 0.9 branch.

The 0.7 Fetcher is simple and easy to understand. I can't say the same of the 0.9 Fetcher. Even after having worked abit with the 0.9 fetcher and map-reduce, I still find myself having to do mental gymnastics to figure out what's going on. BUT THAT'S OK, because writing massively distributable, scaleable yet reliable applications is very very hard, and map-reduce makes this possible and comparatively easy. The question to ask though, is, does your search engine project to crawl and search those 4 job sites fall into this category? If not, you'd want to seriously consider against using the latest 0.8x release of Nutch, and tend to 0.7 instead. Of course, the biggest problem with this, is that 0.7 is not being actively maintained (to my knowledge).

Conclusion

Perhaps someone will read this post and think I'm slighting Nutch, so let me make this really clear: _for what its designed to do_, that is, whole-web crawling, Nutch does a good job of it; if what is needed is to page through search result pages and extract data into a database, Nutch is simply overkill.

Fuzzy string matching

Posted by Kelvin on 03 Jan 2007 | Tagged as: programming

I've been recently peripherally involved in a project which attempts to perform a fuzzy match on names in a MySQL database. With "Homethinking":http://www.homethinking.com, we had to do something similar regarding matching for realtor and brokerage names. Its also related to some of the Lucene consulting I've been involved with.

Its an interesting problem. There's an article on "Wikipedia":http://en.wikipedia.org/wiki/Approximate_string_matching on it. "Project Dedupe":http://tickett.net/dedupe/index.php/Main_Page also has some "interesting links":http://tickett.net/dedupe/index.php/Existing_software worth checking out.

The first (simple) approaches that come to mind are:
* Edit/Levenstein distance
* Phonetic matching, e.g. soundex, metaphone, double metaphone (though these much less useful for names)
* Regex based wildcard matching

One way or going with the regex approach by constructing multiple regex patterns, adding wildcards before each letter. Let me illustrate:
Given the word: *foo*, the following regex patterns are created
* /.*foo/
* /f.*oo/
* /fo.*o/
* /foo.*/

Keeping in mind that we're trying to match for name misspellings, which generally consist of single or at most 2 character variations, sometimes additonal, sometimes missing characters, its not hard to see how this regex will probably turn up the correctly spelled word.

Another variation on the regex approach I was toying around with, was generating the following regex, given the same word *foo*
* /[foo]{3,}/

So look for all (or if the word was longer, some) characters in the word, which takes into account any possible letter transpositions.

Ideally, for fuzzy matches, we'd want to be able to give a guesstimate score of how close we think a match is to the word.

On this project, we didn't need that, and were also constrained by needing to perform queries against MySQL (and not wanting to write/install any MySQL-specific code/functions). MySQL (v5+ I think) does come with a native soundex function which we'll probably end up using.

Of course, IMHO, the smart way to do it is to index the fields using Lucene, then can do n-gram and edit-distance matching, but once again, am constrained for this project.

MySQL Falcon open-sourced

Posted by Kelvin on 02 Jan 2007 | Tagged as: programming

Just read that "MySQL Falcon storage engine":http://www.mysql.org/doc/refman/5.1/en/se-falcon.html has been "open-sourced":http://it.slashdot.org/article.pl?sid=07/01/02/209227.

http://mike.kruckenberg.com/archives/2006/04/jim_starkey_int.html has a really good, concise brief on Falcon and what it does.

Ehud on forces driving commercial programming language creation

Posted by Kelvin on 01 Jan 2007 | Tagged as: programming

Just discovered this gem on LtU here:

Ehud Lamm – Re: Growing a Language
5/25/2004; 3:02:35 PM (reads: 118, responses: 0)
(in response to a previous post: I'll now turn the tables back: is it possible that the "commercial" languages (VB, Java, C#) gained popularity because they were created to provide programmers what they wanted rather than what someone thought they needed?)

I don't think this is the real cause. I know it is a popular explanation (cf. such terms as "bondage & discipline languages") but I think the dynamics are different.

Take .Net as an example (or C & Unix, PL/I & MVS etc.). People wanting to exploit .Net and future OS services from Microsoft are likely to choose either C# or VB.Net. Why? Because Microsoft decided to invest a lot of effort making these the natural choices. So you might say "someone thought the programmers need C#". The other persepctive, i.e. that users use C# because it gives them what they want is equally valid. It's just that C# does this, beacuse someone wanted it to…

A slightly more subtle question is why Microsoft (and others before it) decided to create their own language in this case rather than continue with the previous strategy of providing libraries for C++ or another mainstream language. I think the answer to this question is more enlightening.

One common explanation is that this is simply a way to crush Java. I don't think this is the full story.

Another reason, specific to this case, would be the move to managed code. This is an important factor in the move to CLR and associated languages. But, again, I don't think this is the full story.

Providing a language gives us a very powerful method to influence software construction, control programmer behaviour, and help produce the kinds of software we are interested in (via APIs, compile time analysis and run time support). Likewise, whenever an OS or even an application provider provides a programming language with added support, this will influence programmer behaviour (think VBA; think Javascript; think Guile). There is, yet again, positive feedback.

So it's not just giving programmers what you think they need, it's really about providing some kind of value. But that does not mean that any language that provides useful functionality, APIs etc. is the best tool for each and every software project. It is thus important to distinguish between general purpose languages, and languages used in more restricted domains.

Obviously the dynamics I desrcibed above aren't the only mechanism involved, but it's one of them.

There's no doubt in my mind that the languages we use (spoken, written or programming) dramatically sculpt both the nature and limits of our thoughts and ideas.

Perhaps even more signficantly, it shapes the consciousness of the speakers and the listeners. In terms of programming, that would be the programmers and the end-users, respectively. Spend a day listening to French and another listening to German, then compare how you feel at the end of each day; spend a day working on Windows, and another on Linux, then compare how you feel at the end of each day. The difference is both striking and apalling. There is a clear trickle-down effect starting from the designer(s) of a language to the end-user, and correspondingly, societies, countries and the world (especially for an operating system).

Another way of looking at this, is that the designers of the programming languages that are, at any point in time, popularly used, are indirectly (and with all probability, unconsciously) influencing the consciousness of the world. That's alot of power to folks who might not really be aware of the implications of their choices.

Something to think about.

« Previous PageNext Page »