lukebaker.org

lukebaker.org

Archive for the ‘Projects’ Category

Cookies and Contacts

with 5 comments

As everybody and their mom has been talking about Greasemonkey and AJAX, I decided I had better get a piece of the action. I also had a growing annoyance with the state of MUAs. I was contemplating writing my own, however I’ve since realized that perhaps I can get what I want by using Greasemonkey to add features to Gmail. Persistent Persistent Searches is that first step. The bit of code that I wrote allows people to write Greasemonkey scripts that store data in a Gmail contact. I wanted it to be relatively secure, “nice” to Gmail’s servers, and easy to use. What I ended up with is something that stores data in the note field of a particular contact. The data resembles cookie data and can be used in a very similar fashion. In fact the note data is cached in a cookie. This is my attempt to be nice to Gmail’s servers. On initial login, the note data is downloaded from the contact and stored in a cookie. Any read attempts for note data actually just read the local cookie. Any write attempts change the data in the local cookie and on Gmail’s servers. Since this data is cached in a cookie, I was a little concerned with the expiration time of this cookie (so that other users on the same computer couldn’t see the cookie data). The pseudo-solution was that this cache cookie expires after 3 minutes. However, Gmail checks for new mail every 2 minutes. Each time this happens (and the same user is logged in), the cache cookie will be given another 3 minutes to live. In other words, this cache cookie will be around after the user logs out of Gmail for at most 3 minutes.

Written by Luke

May 25th, 2005 at 2:24 pm

Posted in General,Gmail,Projects

Nutch Shorterm Goals

with 2 comments

  1. Ability to use regular expressions for URL substitutions.
  2. Allow users to to search using url:Store/View/Product/1001
  3. Faster crawling of websites that look like one (1) IP address.
  4. Some sort of templating engine for creating search results pages. Maybe use Velocity?

Written by Luke

September 7th, 2004 at 7:18 pm

Posted in General,Nutch,Projects

Nutch patch #1

without comments

At work I was told to investigate other options for a search engine that would search just the sites that we host. While I was doing that I came across Nutch. It looked pretty sweet but not quite something that would fit our current needs. We needed a few more features. Currently at work we’re looking at a Google Search Appliance. It costs a pretty penny, but would be nice because hopefully that would be something we could just “set it and forget it.”

Lately in my spare time, I’ve started trying to add the features to Nutch that would allow us to use it. It’s fun. I recently submitted my first patch to the Nutch developers list. Hopefully I did everything well enough to get it commited to CVS. This patch allows users to specify Perl 5 regular expressions, which will get applied to all URLs that Nutch encounters. It’s useful for stuff like stripping out session IDs in URLs.

I’ve got a few more features that need to be added. I found another drawback to the way the crawler for Nutch was written. You can specify any number of threads to be running at the same time. However, currently it won’t allow two different threads to download from the same IP simultaneously. This is not good considering all of our websites look to the crawler as just 1 IP. I’ll probably have to make some changes there. Hopefully it’ll be relatively straightforward and easy.

Cool use of Nutch: Creative Commons Search (via: Doug)

Written by Luke

September 6th, 2004 at 5:09 pm

Posted in General,Nutch,Work