Back before Google, a lot of hackers were writing search engines in their free time. The general consensus, at least from my own recollection, was that search was a problem that needed to be solved, and that all the current solutions more or less sucked. Today, search encompasses a huge territory and there are still a lot of problems to be solved, but, for the most part, web search is extremely usable and reliable. It’s not perfect, there’s room for improvement, but it get’s the job done. I don’t know too many people these days who spend their time hacking search. Why re-create such a low-level service when there are so many innovative and higher-level web applications to be built?
The thing is, search is the operating system of the web. The fact that we have no open-source/open-data search infrastructure is as bad as if there were no Linux or OpenBSD. If Google, Yahoo and MS weren’t providing such a great product, my guess is that the hacker community would be attacking this problem like Captain Kirk on a lizard monster.
Where We Are:
Currently, there are a number of open source projects related to general web search. Most notably, the Java based Lucene project is a solid foundation for indexing and information retrieval, and it’s what the Nutch search engine is built on.
There are a few distributed crawlers like Grub and Majestic 12. Unfortunately, these both pass data to a central, private storage system. The hard work of crawling and indexing is open for everyone to participate in, but the resultant data is not.
Where We Need To Be:
In my mind, search hackers need to create an open source solution for the following:
- A distributed mechanism for crawling and indexing the web on a mass scale.
- Distributed, decentralized, redundant data storage for the cache and index.
- An end-user, public facing interface for querying the distributed index.
- A mechanism for retrieving or crawling a local, private slice of the index and cache, for research or personal use.
- A way to publish alternate indexing models to the distributed grid.
All of these tools need to be designed with the assumption that anyone can and will have access to the system’s data, and as the system grows, there will be people, corporations, and governments hell-bent on corrupting the search infrastructure to their advantage.
It’s not an easy problem to solve, but you’ve got to admit it’s an interesting problem. Anyone keen on being the Torvalds of search?
Where To Begin:
The Lucene Project – Link
Nutch Open Source Search Engine – Link
Open Source Search Wiki – Link
Have I missed anything? Please share your thoughts on open source search in the comments.
7 thoughts on “Where’s the open source distributed search?”
Not sure why the sample image got converted to a JPG, but anything other than photos look awful as JPGs. Compare the image above to the original at http://www.r-project.org/hpgraphic.png and see how muddy the compression artifacts make things look.
I’m a PhD student in computational biology and I program a LOT of R (pretty much exclusively). It is incredibly powerful because 1) it’s extensible – even with C or FORTRAN code 2) it can be used as a general-purpose scripting language (personally I find it much more intuitive than Python, for instance) 3) it’s geared toward handling real data and advanced machine learning and modeling, etc. is pretty easy to do 4) it’s easy to create beautiful plots and graphics in very few lines of code.
Right now as a hobby I’m working on an R package to interface with Arduino. I’ve already got basic serial communication between the board and R working well. My dream is to someday release the package so that R and Arduino (aRduino, anyone?) can be used together as a platform for open scientific instrumentation development.
The serial communication part is easy, but the real beauty would be in leveraging R’s powerful analysis tools in a real-time way, so that data collection and analysis can happen simultaneously. R also has GTK+ bindings, so it’s pretty easy to write user interfaces in R, which would be nice for instrument software development.
1) A gallery of example R graphics:
These are contributed by users and vary quite a bit in quality. The output also looks much better when directed to a Cairo or PDF or Quartz (on OS X) device, as opposed to what is shown here.
2) Bioconductor, a subset of R packages for bioinformatics.
3) some good general R tips
4) the useR! conference is held every year with presentations and posters from a very wide variety of fields where R is applied in analysis. Here’s the program of presentations and posters (with PDFs) from 2007:
R is quite nice package, but i was not able to make it read data from mysql in reasonable time (500000 records: R – >24hours, Matlab – ~2min). it could be that you have to use some trick, but i did not find any solution
That’s strange. I routinely fetch 30000+ records from a remote MySQL server in just a few (maybe 10-12) seconds from within R.
well, probably i have to try once more ;-)
do you fetch only numerical data, or some strings to?
Comments are closed.