Hadoop is a Java-based distributed application and storage framework that’s designed to run on thousands of commodity machines. You can think of it as an open source approximation of Google’s search infrastructure. Yahoo!, in fact, runs many components of its search and ad products on Hadoop, and it’s not too surprising that they are a major contributor to the project.

MapReduce is a method for writing software that can be parallelized across thousands of machines to process enormous amounts of data. For instance, let’s say you want to count the number of referrals, by domain, in all the world’s Apache server logs. Here’s the gist of how you’d do it:

  1. Get all the world to upload their server logs to your gigantor distributed file system. You might automate and approximate this by having every web administrator add some javascript code to their site that causes their visitor’s browsers to ping your own server, resulting in one giant log file of all the world’s server logs. Your filesystem of choice is HDFS, the Hadoop Distributed Filesystem, which handles partitioning and replicating this enormous file between all of your cluster nodes.
  2. Split the world’s largest log file into tiny pieces, and have your thousands of cluster machines parse the pieces, looking for referrers. This is the “Map” phase. Each chunk is processed and the referrers found in that chunk are output back to the system, which stores the output keyed by the referrer hostname. The chunk assignments are optimized so that the cluster nodes will process chunks of data that happen to be stored on their local fragment of the distributed file system.
  3. Finally, all the outputs from the Map phase are collated. This is called the “Reduce” phase. The cluster nodes are assigned a hostname key that was created during the Map phase. All of the outputs for that key are read in by the node and counted. The node then outputs a single result which is the domain name of the referrer, and the total number of referrals that were produced from that referrer. This is done hundreds of thousands of times, once for each referrer domain, and distributed across the thousands of cluster nodes.

At the end of this hypothetical MapReduce job, you’re left with a concise list of each domain that’s referred traffic, and a count of how many referrals it’s given. What’s cool about Hadoop and MapReduce is that it makes writing distributed applications like this surprisingly simple. The two functions to perform the example referrer parsing might only be about 20 lines of code. Hadoop takes care of the immense challenges of distributed storage and processing, letting you focus on your specific task.

Since Hadoop is written in Java, the natural way for you to create distributed jobs is to encapsulate your Map and Reduce functions into a java class. If you’re not a Java junkie, though, don’t worry, there’s a job wrapper called HadoopStreaming which can communicate with any program you write with the usual STDIN and STDOUT. This lets you write your distributed job in Perl, Python or even a shell script! You create two programs, one for the mapper and one for the reducer, and HadoopStreaming handles uploading them to all of the cluster nodes and passing data to and from your programs.

If you want to play around with this, I really recommend a couple of howtos written by German hacker Michael G. Noll. He put together a walkthrough for getting Hadoop up and running on Ubuntu, and also a nice introduction to writing a MapReduce program using HadoopStreaming (with Python as an example).

Are any Hackszine readers using Hadoop? Let us know what you’re doing and point us to more information in the comments.

Running Hadoop On Ubuntu Linux
Writing An Hadoop MapReduce Program In Python