<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>lukebaker.org &#187; Nutch</title>
	<atom:link href="http://lukebaker.org/archives/category/projects/nutch/feed/" rel="self" type="application/rss+xml" />
	<link>http://lukebaker.org</link>
	<description>lukebaker.org</description>
	<lastBuildDate>Tue, 08 Mar 2011 22:39:20 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Nutch and Bittorrent</title>
		<link>http://lukebaker.org/archives/2005/05/25/nutch-and-bittorrent/</link>
		<comments>http://lukebaker.org/archives/2005/05/25/nutch-and-bittorrent/#comments</comments>
		<pubDate>Thu, 26 May 2005 00:31:11 +0000</pubDate>
		<dc:creator>Luke</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Nutch]]></category>

		<guid isPermaLink="false">http://lukebaker.org/archives/2005/05/25/nutch-and-bittorrent/</guid>
		<description><![CDATA[Cool. The new bittorrent search engine uses Nutch.]]></description>
			<content:encoded><![CDATA[<p>Cool.  The new <a href="http://search.bittorrent.com/">bittorrent search engine</a> uses <a href="http://www.nutch.org/">Nutch.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://lukebaker.org/archives/2005/05/25/nutch-and-bittorrent/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Nutch Shorterm Goals</title>
		<link>http://lukebaker.org/archives/2004/09/07/nutch-shorterm-goals/</link>
		<comments>http://lukebaker.org/archives/2004/09/07/nutch-shorterm-goals/#comments</comments>
		<pubDate>Tue, 07 Sep 2004 23:18:44 +0000</pubDate>
		<dc:creator>Luke</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Projects]]></category>

		<guid isPermaLink="false">http://lukebaker.org/archives/2004/09/07/nutch-shorterm-goals/</guid>
		<description><![CDATA[Ability to use regular expressions for URL substitutions. Status: Commited to CVS. Allow users to to search using url:Store/View/Product/1001 Status: Commited to CVS. Faster crawling of websites that look like one (1) IP address. Status: Commited to CVS. Some sort of templating engine for creating search results pages. Maybe use Velocity?]]></description>
			<content:encoded><![CDATA[<ol>
<li>Ability to use regular expressions for URL substitutions.</li>
<ul>
<li>Status: <strong><em><a href="http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/CHANGES.txt?r1=1.28&#038;r2=1.29">Commited to CVS.</a></em></strong></li>
</ul>
<li>Allow users to to search using url:Store/View/Product/1001</li>
<ul>
<li>Status: <em><strong><a href="http://sourceforge.net/mailarchive/forum.php?thread_id=5529889&#038;forum_id=12036">Commited to CVS.</a></strong></em></li>
</ul>
<li>Faster crawling of websites that look like one (1) IP address.</li>
<ul>
<li>Status: <strong><em><a href="http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/CHANGES.txt?r1=1.43&#038;r2=1.44">Commited to CVS.</a></em></strong></li>
</ul>
<li>Some sort of templating engine for creating search results pages.  Maybe use <a href="http://jakarta.apache.org/velocity/">Velocity</a>?</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://lukebaker.org/archives/2004/09/07/nutch-shorterm-goals/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Nutch patch #1</title>
		<link>http://lukebaker.org/archives/2004/09/06/nutch-patch-1/</link>
		<comments>http://lukebaker.org/archives/2004/09/06/nutch-patch-1/#comments</comments>
		<pubDate>Mon, 06 Sep 2004 21:09:10 +0000</pubDate>
		<dc:creator>Luke</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Work]]></category>

		<guid isPermaLink="false">http://lukebaker.org/archives/2004/09/06/nutch-patch-1/</guid>
		<description><![CDATA[At work I was told to investigate other options for a search engine that would search just the sites that we host. While I was doing that I came across Nutch. It looked pretty sweet but not quite something that would fit our current needs. We needed a few more features. Currently at work we&#8217;re [...]]]></description>
			<content:encoded><![CDATA[<p>At <a href="http://www.gospelcom.net/">work</a> I was told to investigate other options for a search engine that would search just the sites that we host.  While I was doing that I came across <a href="http://www.nutch.org/">Nutch.</a>  It looked pretty sweet but not quite something that would fit our current needs.  We needed a few more features.  Currently at work we&#8217;re looking at a <a href="http://www.google.com/appliance/">Google Search Appliance.</a>  It costs a pretty penny, but would be nice because hopefully that would be something we could just <a href="http://www.ronco.com/products/rotisserie_std.di4?productID=1">&#8220;set it and forget it.&#8221;</a></p>
<p>Lately in my spare time, I&#8217;ve started trying to add the features to Nutch that would allow us to use it.  It&#8217;s fun.  I recently <a href="http://sourceforge.net/mailarchive/forum.php?thread_id=5515493&#038;forum_id=13068">submitted</a> my first <a href="http://lukebaker.org/upload/RegexUrlNormalizer.patch">patch</a> to the Nutch developers list.  Hopefully I did everything well enough to get it commited to CVS.  This patch allows users to specify Perl 5 regular expressions, which will get applied to all URLs that Nutch encounters.  It&#8217;s useful for stuff like stripping out session IDs in URLs.</p>
<p>I&#8217;ve got a few more features that need to be added.  I found another drawback to the way the crawler for Nutch was written.  You can specify any number of threads to be running at the same time.  However, currently it won&#8217;t allow two different threads to download from the same IP simultaneously.  This is not good considering all of our websites look to the crawler as just 1 IP.  I&#8217;ll probably have to make some changes there.  Hopefully it&#8217;ll be relatively straightforward and easy.</p>
<p><em>Cool use of Nutch: <a href="http://creativecommons.org/weblog/entry/4388">Creative Commons Search</a> (via: <a href="http://www.nutch.org/blog/2004_09_01_cutting_archive.html">Doug</a>)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://lukebaker.org/archives/2004/09/06/nutch-patch-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.216 seconds -->

