Back before Google, a lot of hackers were writing search engines in their free time. The general consensus, at least from my own recollection, was that search was a problem that needed to be solved, and that all the current solutions more or less sucked. Today, search encompasses a huge territory and there are still a lot of problems to be solved, but, for the most part, web search is extremely usable and reliable. It’s not perfect, there’s room for improvement, but it get’s the job done. I don’t know too many people these days who spend their time hacking search. Why re-create such a low-level service when there are so many innovative and higher-level web applications to be built?

The thing is, search is the operating system of the web. The fact that we have no open-source/open-data search infrastructure is as bad as if there were no Linux or OpenBSD. If Google, Yahoo and MS weren’t providing such a great product, my guess is that the hacker community would be attacking this problem like Captain Kirk on a lizard monster.

Where We Are:

Currently, there are a number of open source projects related to general web search. Most notably, the Java based Lucene project is a solid foundation for indexing and information retrieval, and it’s what the Nutch search engine is built on.

There are a few distributed crawlers like Grub and Majestic 12. Unfortunately, these both pass data to a central, private storage system. The hard work of crawling and indexing is open for everyone to participate in, but the resultant data is not.

Where We Need To Be:

In my mind, search hackers need to create an open source solution for the following:

  • A distributed mechanism for crawling and indexing the web on a mass scale.
  • Distributed, decentralized, redundant data storage for the cache and index.
  • An end-user, public facing interface for querying the distributed index.
  • A mechanism for retrieving or crawling a local, private slice of the index and cache, for research or personal use.
  • A way to publish alternate indexing models to the distributed grid.

All of these tools need to be designed with the assumption that anyone can and will have access to the system’s data, and as the system grows, there will be people, corporations, and governments hell-bent on corrupting the search infrastructure to their advantage.

It’s not an easy problem to solve, but you’ve got to admit it’s an interesting problem. Anyone keen on being the Torvalds of search?

Where To Begin:

The Lucene Project – Link
Nutch Open Source Search Engine – Link
Open Source Search Wiki – Link

Have I missed anything? Please share your thoughts on open source search in the comments.