Nutch patch #1

September 6, 2004

At work I was told to investigate other options for a search engine that would search just the sites that we host. While I was doing that I came across Nutch. It looked pretty sweet but not quite something that would fit our current needs. We needed a few more features. Currently at work we’re looking at a Google Search Appliance. It costs a pretty penny, but would be nice because hopefully that would be something we could just “set it and forget it.”

Lately in my spare time, I’ve started trying to add the features to Nutch that would allow us to use it. It’s fun. I recently submitted my first patch to the Nutch developers list. Hopefully I did everything well enough to get it commited to CVS. This patch allows users to specify Perl 5 regular expressions, which will get applied to all URLs that Nutch encounters. It’s useful for stuff like stripping out session IDs in URLs.

I’ve got a few more features that need to be added. I found another drawback to the way the crawler for Nutch was written. You can specify any number of threads to be running at the same time. However, currently it won’t allow two different threads to download from the same IP simultaneously. This is not good considering all of our websites look to the crawler as just 1 IP. I’ll probably have to make some changes there. Hopefully it’ll be relatively straightforward and easy.

Cool use of Nutch: Creative Commons Search (via: Doug)