java.net.URL synchronization bottleneck
Posted by Kelvin on 08 Dec 2009 at 02:40 pm | Tagged as: programming, crawling
This is interesting because I haven't found anything on google about it.
There's a static Hashtable in java.net.URL (urlStreamHandlers) which gets invoked with every constructor call. Well, turns out when you're running a crawler with, say 50 threads, that turns out to be a major bottleneck.
Of 70 threads, I had running, 48 were blocked on the java.net.URL ctor. I was using the URL class for resolving relative URLs to absolute ones.
Since I had previously written a URL parser to parse out the parts of a URL, I went ahead and implemented my own URL resolution function.
Went from
Status: 12.407448 pages/s, 207.06316 kb/s, 2136.143 bytes/page
to
Status: 43.9947 pages/s, 557.29156 kb/s, 1621.4071 bytes/page
after increasing the number of threads to 100 (which would not have made much difference in the java.net.URL implementation).
Cool stuff.