Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about work

Basis – Infrastructure for Java projects

Posted by Kelvin on 04 Aug 2005 | Tagged as: work, programming

Basis is an open-source project providing commonly-required components such as user management, login/authentication, role-based access management/authorization and Hibernate support classes.

The data models are derived from The Data Model Resource Book by Len Silverton and adapted for Hibernate.

Unlike other open-source projects, Basis was never meant to be used as-is, but rather as a starting point for your individual project needs.

Current version : 0.1-alpha
Notes: 01-alpha is complete, but not release-quality. Still, do take it for a trial run.

Downloads
Download the latest version basis-0.1-alpha.zip
View javadoc documentation

License
Basis is released under the BSD license (includes software from the Apache foundation).

Credits
Big thanks to Gesakon GmBH for sponsoring development of Basis, and generously allowing it to be open-sourced.

TODO:
There is quite some code providing a web-based UI using WebWork/Velocity which is not ready for release yet.
Test cases exist, but they're pretty out-dated!

Mozilla Archive Format

Posted by Kelvin on 20 Jul 2005 | Tagged as: work

A Firefox extension that supports saving complete web pages (including images!) to a single archive file (.maf) or even to a MHTML compatible format which can supposedly be opened in Internet Explorer.

Two thumbs up!

http://extensionroom.mozdev.org/more-info/maf

Improving Nutch for constrained crawls

Posted by Kelvin on 18 Jul 2005 | Tagged as: Lucene / Solr / Elasticsearch / Nutch, work, programming

A heads-up to anyone who is using Nutch for anything but whole-web crawling..

I'm about 70% through patching Nutch to be more usable for constrained crawling. It is basically a partial re-write of the whole fetching mechanism, borrowing large chunks of code here and there.

Features include:
– Customizable seed inputs, i.e. seed a crawl from a file, database, Nutch FetchList, etc
– Customizable crawl scopes, e.g. crawl the seed URLs and only the urls within their domains. (this can already be manually accomplished with RegexURLFilter, but what if there are 200,000 seed URLs?), or crawl seed url domains + 1 external link (not possible with current filter mechanism)
– Online fetchlist building (as opposed to Nutch's offline method), and customizable strategies for building a fetchlist. The default implementation gives priority to hosts with a larger number of pages to crawl. Note that offline fetchlist building is ok too.
– Customizable fetch output mechanisms, like output to file, to segment, or even not at all (if we're just implementing a link-checker, for example)
– Fully utilizes HTTP 1.1 features (see below)

My aim is to be fully compatible with Nutch as it is, i.e. given a Nutch fetchlist, the new crawler can produce a Nutch segment. However, if you don't need that at all, and are just interested in Nutch as a crawler, then that's ok too!

Nutch and HTTP 1.1
Nutch as it is, does not utilize HTTP 1.1's connection persistence and request pipelining features which would significantly cut down crawling time when crawling a small number of hosts extensively.

To test this hypothesis out, I modified Http and HttpResponse to use HTTP 1.1 features, and fetch 20 URLs from the same host repeatedly (15 requests were pipelined per socket connection, which required a total of 2 separate connections to the server). I repeated this test with Nutch's Http (protocol-http). After a couple of runs, I was convinced that at least in my test-case, the performance improvement of the optimized Http class was between 3-4x faster than Nutch's Http version.

I believe the reasons for this speed-up are:
1. Reduced connection overhead (2 socket connections vs 20)
2. Reduced wait-time between requests (once vs 19 times). This is the fetcher.server.delay value in NutchConf.
3. Request pipelining, (not waiting for a request to produce a response before sending the next request)

Note that this is a contrived example of course, but not one that is improbable when running crawls of large domains, and large number of fetcher threads.

I've still a pretty long way to go in terms of testing, but I think a good portion of the difficult bits have already been ironed out. When I'm satisfied that its stable, I'll release a version for download and feedback.

16082005 Update
You might be interested in Reflections on modifying the Nutch crawler

Next and previous tabs in Firefox using Keyconfig

Posted by Kelvin on 01 Jul 2005 | Tagged as: work

Next tab:
getBrowser().mTabBox._tabs.advanceSelectedTab(1);

Prev tab:
getBrowser().mTabBox._tabs.advanceSelectedTab(-1);

Tuning Lucene

Posted by Kelvin on 22 Mar 2005 | Tagged as: work, Lucene / Solr / Elasticsearch / Nutch

Recommended values for mergeFactor, minMergeDocs, maxMergeDocs
http://marc.theaimsgroup.com/?l=lucene-user&m=110235452004799&w=2

Best Practices for Distributing Lucene Indexing and Searching
http://marc.theaimsgroup.com/?l=lucene-user&m=110973989200204&w=2

Optimizing for long queries
http://marc.theaimsgroup.com/?l=lucene-user&m=108840979004587&w=2
http://marc.theaimsgroup.com/?l=lucene-user&m=110969921306938&w=2

ParallellMultiSearcher Vs. One big Index
http://marc.theaimsgroup.com/?l=lucene-user&m=110607674000954&w=2

Lucene Performance Bottlenecks
http://marc.theaimsgroup.com/?l=lucene-user&m=113352444722705&w=2

Thoughts about Nutch

Posted by Kelvin on 10 Mar 2005 | Tagged as: work, Lucene / Solr / Elasticsearch / Nutch

I've been working on Nutch lately for a client, and its good fun feeling my way around such an ambitious project. Its still rather immature – the code is stable, and there are no major bugs, but the API isn't yet developer-friendly, in that its difficult to extend many classes without patching Nutch directly.

Its interesting to see Doug Cutting put Lucene through its paces in Nutch. It gives an indication of how Lucene can be made to do some interesting stuff. I think Nutch is the best available case study for how to power-use Lucene, and do stuff like distributed indexing and searching.

I would love to see

  1. the crawling part of Nutch extracted into a separate lib, and I made a request on the mailing list for it, but no response..
  2. easier-to-use console apps for manipulating the webdb
  3. …TBD

Vertical searching

Posted by Kelvin on 24 Feb 2005 | Tagged as: work

Whilst there may be a large opportunity to capitalize on the consumer search experience through vertical search "portals", is there a real business model for an industry where Google effectively monopolizes the market, and in the blink of an eye, can overwhelm existing vertical players by deciding to enter the vertical search fray? Just think Microsoft, minus the "we try not to piss too many companies off so they will still continue building applications on Windows".

Well, ok. If companies attempt to emulate Google, but in a industry/vertical-specific way, they certainly are vulnerable to inroads by Google, but if

  • a company can gain significant mindshare as market leader
  • the vertical lends itself to communities and/or requires significant effort which cannot be easily emulated without diluting Google's focus of being a search company

then maybe there is alot of money to be made.

Still, I'd like to pull your attention back to the portal heyday (think Yahoo!, Altavista, MSN). If you were around in those days, you'll remember how much hype there was around "vortals", or vertical portals. Well, what's happened to them? Are they still around, just that the terms vortals and portals have fallen out of favour with VCs?

Effective delegation

Posted by Kelvin on 28 Jan 2005 | Tagged as: work

http://www.getmoredone.com/tips9.html and http://www.bignoseduglyguy.com/bnugwiki/HowToDelegateEffectively have some tips on delegating.

That's something I've always had problems with, and the info here should come in handy sometime.

Another take on OfBiz

Posted by Kelvin on 27 Jan 2005 | Tagged as: work

Seems like someone else is equally apalled by ofBiz's complexity.

http://www.theserverside.com/tss?service=direct/0/NewsThread/threadViewer.markNoisy.link&sp=l25373&sp=l119009#118827

Some months ago I was looking at the code of open for business: while it's architecture is pretty clean, it was a mess of freemarker + JSP + jpublish mvc + pages bound to page classes with an xml file per page + beanshell actions + 2 different xml scripting "micro"-languages (average length of line: 120 chars) which "speed up development of business logic" + services configured in xml files + java for the core + ofbiz entity engine for persistence + XPDL-defined workflows + business rules with their own language.

Prototyping/wireframing

Posted by Kelvin on 09 Jan 2005 | Tagged as: work

As I'm finishing up the persona goals and moving to designing the wireframe prototype of PracticalKM, I stumbled upon a couple of great resources for prototyping with Visio. Paper prototyping is alright, but I don't think it scales…

Visio – the interaction designer's nail gun
http://www.stcsig.org/usability/newsletter/0007-prototypingvisio.html
A visual vocabulary for describing information architecture and interaction design
http://www.nickfinck.com/stencils.html (which contains more links to Visio stencils for prototyping)

http://www.dmxzone.com/ShowDetail.asp?NewsId=3991 contains good information on wireframing.

There's also a wireframing extension for Dreamweaver.

« Previous PageNext Page »