The Hadoop Distributed File System (HDFS) is an open source, distributed file system that’s designed to run on commodity hardware and provide a fault tolerant data store optimized capable of storing extremely large files across many machines. It’s similar in architecture to the Google File System, except that it’s something you can install and play with yourself.
Most people who are using HDFS are using it as a storage system for extremely customized distributed application development and the HDFS data is accessed programatically via a Java API or through a few rudimentary shell commands. But what if you just want to use it as a general purpose file system that will automatically replicate many terabytes of data across a number of spare boxes hanging around the office? That’s where FUSE comes in.
These projects (enumerated below) allow HDFS to be mounted (on most flavors of Unix) as a standard file system using the mount command. Once mounted, the user can operate on an instance of hdfs using standard Unix utilities such as ‘ls’, ‘cd’, ‘cp’, ‘mkdir’, ‘find’, ‘grep’, or use standard Posix libraries like open, write, read, close from C, C++, Python, Ruby, Perl, Java, bash, etc.
There are currently a few HDFS FUSE projects, some of which seem to be more maintained than others. One, called fuse-j-hdfs, is written with the FUSE for Java library and seems to be the most active project. Outside of FUSE, there’s also a Webdav wrapper for HDFS that should provide mountable access from Windows clients.
Are there any HDFS gurus in the room who’d care to chime in with their own experiences with any of these tools?