Comfortably Geek

Lalith Suresh

© 2014. Lalith Suresh All rights reserved.

Towards a Scalable and Highly Available HDFS Namenode

After 3 months of intense hacking, I'm pleased to be writing about a little something I worked on for a project course here at KTH.

The premise

So we're all familiar with Hadoop, right? It's the little yellow elephant that provides an excellent platform for distributed computing, which is seeing rapid adoption by the industry, and involvement from major players like Yahoo!, Facebook and recently, Microsoft. Well, Hadoop and friends use the Hadoop Distributed File System (HDFS) as their underlying storage layer. Given the kind of jobs that are expected to run on top of it, HDFS is designed to store large files, and is optimised for throughput as opposed to latency.

HDFS is a single-master-server based distributed file system. Architecturally speaking, HDFS comprises of three important entities:

  1. Clients, who read/write files from/to the filesystem.
  2. Datanodes, which actually store the data (blocks) associated with the files.
  3. The Namenode, which is a central server that stores all the metadata associated with the files, and blocks.

This division between metadata storage and data storage is important, because typical use cases of HDFS are data intensive, and not metadata intensive. That's fine, but the problem is, if the Namenode crashes, the entire file system becomes inoperable because clients and Datanodes still need the metadata to do anything useful. Furthermore, since the Namenode maintains all the metadata only in memory, the number of files you can store on the filesystem is directly proportional to the amount of RAM the Namenode has. As if that's not enough, the Namenode will be completely saturated under write intensive workloads, and will be unable to respond to even simple client side queries like "ls". Have a look at Shvachko's paper which describes these problems at great length and depth, on which we've based our work.

Long story short, the needs of the hour are:

Our solution

In order to recover from crashes, the Namenode maintains a journal of all changes that it makes to the metadata. This pretty much involves logging every operation made to disk, and there is quite a huge piece of code related to this as well. However, the database community has been doing journaling, checkpointing and replicated storage since quite a while. So if you haven't guessed our solution yet, here it is:

"Move all of the Namenode's metadata storage into an in-memory, replicated, share-nothing distributed database."

In short, Namenodes themselves are reduced to a stateless frontend to the database, and fetch state into memory only when required. This comes with the added advantage of being able to have multiple stateless Namenodes for the same filesystem namespace. We chose MySQL Cluster as our database because of its wide spread use and stability. So for the filesystem to scale to a larger number of files, one needs to add more MySQL Cluster Datanodes, thus moving the bottleneck from the Namenode's RAM to the DB's storage capacity. For the filesystem to handle heavier workloads, one needs to add only more Namenode machines and divide the load amongst them. Another interesting aspect is that if a single Namenode machine has to reboot, it needn't fetch any state into memory and will be ready for action within a few seconds (although it still has to sync with Datanodes). Another advantage of our design is that the modifications will not affect the clients or Datanodes in anyway, except that we might need to find a way to divide the load among the Namenodes.

How we did it

We first dissected all the internal protocols being used in HDFS, i.e, the client-Namenode, Namenode-Datanode, and client-Datanode protocols. Next, we stripped out all the Namenode code that we didn't need. This was pretty much the code related to journaling, checkpointing, the secondary Namenode and so forth.

Next, we identified the key data structures we needed to move to the DB. We picked the two most memory intensive data-structures to migrate first: the Inodes, and the Blocks.

Since we were heavily time constrained (three months to deliver the project and the report), we decided to focus on functional correctness first, and then optimise later. So the easiest course of action seemed to be to modify the lowest levels of the call chain, replacing reads/writes from/to memory with query, insert, update and delete operations on the DB. We developed two helper classes, one each for Inodes and Blocks, and interfaced with the DB through these methods. We used the ClusterJ connector to talk to MySQL. This obviously meant that we needed a flat row representation for Inodes and Blocks in the DB, and we had some other problems to think of as well on the way. How do we index Inodes? How do we index Blocks? What about Triplets?

All in all, we tackled the problem of scaling the Namenode with a set of design decisions which we later found to be consistent with Shvacko's update paper on the Namenode's scalability, except that he suggests using HBase as the DB.

Current status

Current limitations

Obviously, our work isn't rainbows and sunshine; there's a long way to go. Here's what we don't have yet and are currently addressing:

Talk is cheap, show me the code!

The code is publicly available here for thy scrutiny. You'll also need to have a MySQL cluster setup in order to test this (we have a hard coded set of defaults in which you can politely ignore). :) Here's our presentation on it as well. We've dubbed our prototype implementation KTHFS (because we're students at KTH, but yes, no points for creativity on that one).

Future work

As an academic obligation, here's future work (and no I'm not going to write stuff we'll never do).

One member (not me) from the team will be continuing this work as part of his Masters thesis, and plans to address the above mentioned limitations as part of his work. I'll try to contribute as well during my free time (what are weekends for anyway?). Let's see how this goes. :)

comments powered by Disqus