• Towards a Scalable and Highly Available HDFS Namenode

    After 3 months of intense hacking, I'm pleased to be writing about a little something I worked on for a project course here at KTH.

    The premise

    So we're all familiar with Hadoop, right? It's the little yellow elephant that provides an excellent platform for distributed computing, which is seeing rapid adoption by the industry, and involvement from major players like Yahoo!, Facebook and recently, Microsoft. Well, Hadoop and friends use the Hadoop Distributed File System (HDFS) as their underlying storage layer. Given the kind of jobs that are expected to run on top of it, HDFS is designed to store large files, and is optimised for throughput as opposed to latency.

    HDFS is a single-master-server based distributed file system. Architecturally speaking, HDFS comprises of three important entities:

    1. Clients, who read/write files from/to the filesystem.
    2. Datanodes, which actually store the data (blocks) associated with the files.
    3. The Namenode, which is a central server that stores all the metadata associated with the files, and blocks.

    This division between metadata storage and data storage is important, because typical use cases of HDFS are data intensive, and not metadata intensive. That's fine, but the problem is, if the Namenode crashes, the entire file system becomes inoperable because clients and Datanodes still need the metadata to do anything useful. Furthermore, since the Namenode maintains all the metadata only in memory, the number of files you can store on the filesystem is directly proportional to the amount of RAM the Namenode has. As if that's not enough, the Namenode will be completely saturated under write intensive workloads, and will be unable to respond to even simple client side queries like "ls". Have a look at Shvachko's paper which describes these problems at great length and depth, on which we've based our work.

    Long story short, the needs of the hour are:

    • High availability for the Namenode, i.e, no single point of failure.
    • Horizontal scalability for the Namenode, i.e, to handle heavier loads, one would need to only add more Namenodes to the system than having to upgrade a single Namenode's hardware.

    Our solution

    In order to recover from crashes, the Namenode maintains a journal of all changes that it makes to the metadata. This pretty much involves logging every operation made to disk, and there is quite a huge piece of code related to this as well. However, the database community has been doing journaling, checkpointing and replicated storage since quite a while. So if you haven't guessed our solution yet, here it is:

    "Move all of the Namenode's metadata storage into an in-memory, replicated, share-nothing distributed database."

    In short, Namenodes themselves are reduced to a stateless frontend to the database, and fetch state into memory only when required. This comes with the added advantage of being able to have multiple stateless Namenodes for the same filesystem namespace. We chose MySQL Cluster as our database because of its wide spread use and stability. So for the filesystem to scale to a larger number of files, one needs to add more MySQL Cluster Datanodes, thus moving the bottleneck from the Namenode's RAM to the DB's storage capacity. For the filesystem to handle heavier workloads, one needs to add only more Namenode machines and divide the load amongst them. Another interesting aspect is that if a single Namenode machine has to reboot, it needn't fetch any state into memory and will be ready for action within a few seconds (although it still has to sync with Datanodes). Another advantage of our design is that the modifications will not affect the clients or Datanodes in anyway, except that we might need to find a way to divide the load among the Namenodes.

    How we did it

    We first dissected all the internal protocols being used in HDFS, i.e, the client-Namenode, Namenode-Datanode, and client-Datanode protocols. Next, we stripped out all the Namenode code that we didn't need. This was pretty much the code related to journaling, checkpointing, the secondary Namenode and so forth.

    Next, we identified the key data structures we needed to move to the DB. We picked the two most memory intensive data-structures to migrate first: the Inodes, and the Blocks.

    Since we were heavily time constrained (three months to deliver the project and the report), we decided to focus on functional correctness first, and then optimise later. So the easiest course of action seemed to be to modify the lowest levels of the call chain, replacing reads/writes from/to memory with query, insert, update and delete operations on the DB. We developed two helper classes, one each for Inodes and Blocks, and interfaced with the DB through these methods. We used the ClusterJ connector to talk to MySQL. This obviously meant that we needed a flat row representation for Inodes and Blocks in the DB, and we had some other problems to think of as well on the way. How do we index Inodes? How do we index Blocks? What about Triplets?

    All in all, we tackled the problem of scaling the Namenode with a set of design decisions which we later found to be consistent with Shvacko's update paper on the Namenode's scalability, except that he suggests using HBase as the DB.

    Current status

    • Multiple stateless Namenodes run merrily, which store Inode and Block related metadata in MySQL Cluster. As a validation test, Clients can do an "ls" query to any Namenode and see a consistent view of the filesystem regardless of which Namenode updated the DB with the content.
    • We're trying to ensure functional correctness using the HDFS unit tests. We got the most important ones to pass, and decided to keep some more bug fixing until later because we needed to evaluate the system as part of the course.
    • We've been evaluating the system using the Synthetic Load Generator. Horizontal scalability has been clearly achieved; adding more Namenodes improves the average number of operations per second for different work loads. With write intensive work loads, the scalability is linear in terms of total operations/sec that are executed.

    Current limitations

    Obviously, our work isn't rainbows and sunshine; there's a long way to go. Here's what we don't have yet and are currently addressing:

    • Performance improvements. With a single load-generator thread throwing requests at the Namenode, we're within a 10th of the original Namenode's performance because read/writes from/to memory now go over a network to a database cluster (which is OK, I guess). But with more LoadGen threads, we're experiencing a hefty bottleneck, which I'll describe in the next point.
    • The Namenode isn't fully stateless yet. The most important data structures we're yet to move are the DatanodeDescriptor entities and the file leases. There'll surely be more, but these are the most crucial ones. Once full statelessness is achieved, we can eliminate the read-write locks in the code which are absolutely not needed any more in our implementation (the Namenode currently uses a multiple-reader-single-writer concurrency model). Profiling experiments indicated that the Namenode spends around 70% of its time waiting to acquire write locks. If we keep the Namenode fully stateless, we can wrap FSNamesystem operations into Database transactions which can be batched, and let MySQL cluster handle serialisability for us (which can handle write-heavy transactions really well). We can even break away from the single-writer model that the Namenode currently uses. Will this lead to a higher throughput for write operations than the original Namenode? Maybe. :)
    • Clients and Datanodes have to be statically partitioned as of now (it sufficed for our evaluations). We need a way for them to pick a random Namenode to perform their operations with.

    Talk is cheap, show me the code!

    The code is publicly available here for thy scrutiny. You'll also need to have a MySQL cluster setup in order to test this (we have a hard coded set of defaults in DBConnector.java which you can politely ignore). :) Here's our presentation on it as well. We've dubbed our prototype implementation KTHFS (because we're students at KTH, but yes, no points for creativity on that one).

    Future work

    As an academic obligation, here's future work (and no I'm not going to write stuff we'll never do).

    One member (not me) from the team will be continuing this work as part of his Masters thesis, and plans to address the above mentioned limitations as part of his work. I'll try to contribute as well during my free time (what are weekends for anyway?). Let's see how this goes. :)

  • Let's make metros more interesting

    After moving to Europe for my masters, I've been accustomed to finding my way around cities using the subway trains. They're convenient, relatively cheap (unless you're in Stockholm, where using the word cheap throws an exception), and usually fast enough as well. The only problem I have with metros is that they're boring as shit to travel in, especially when you're alone. Sure you can carry a book with you or invest in a tab/pad/slate/smartphone/whatever, but we definitely need to do something about the view from the windows.

    Given that there's hardly anything interesting about rocks and the occasional flash of light in a metro tunnel, I think there's a lot that can be done to make the metro experience better. Here's one idea I have to solve this international crisis which is oh-so-more-important than anything else on this planet. I call it, metro cartoons (the animated ones).

    The basic idea is to insert cartoon panels between the tunnel walls and the tracks. Since we know the approximate speed at which the trains move, you can estimate the minimum spacing between the panels that would be required to provide a reasonable frame rate required for the animation. I'm sure artists understand these aspects better (thus, I don't), and can come up with the right kind of drawings for this setting.

    What say? I'm pretty sure someone must have tried this already, so let me know if you've heard of any such attempt.

  • Lagom

    Lagom (pronounced [ˈlɑ̀ːɡɔm]) is a Swedish word with no direct English equivalent, meaning "just the right amount". - Wikipedia

    If you're in Sweden and there's one word you should know, it would be 'lagom', a word that defines the essence of everything that is Swedish.

    More than two months into Stockholm and I must say it's been pretty lagom so far. The city itself is beautiful, with the only downside being that everything is at least twice as expensive as in Lisbon. Maybe that's what you get for everything around here being so perfect and systematic. And unlike Lisbon, you actually have easy access to a *large* collection of beers, which is definitely a plus.

    It's autumn here right now, and the scenery outside is fabulous. Sadly though, the weather in Sweden is not-so-lagom. Temperatures are already hovering around 5 degrees celsius, and for an Indian like me, this is pretty cold in itself (heck, it never gets lower than 20 in my town back home). The funny thing is, this weather is cute for the Swedes. Let's see how winter ends up looking like.

    Meanwhile, here are some pictures I took from Stockholm over the last two months.

  • Fourth Annual Workshop on NS-3, 2012

    We're organising the fourth annual Workshop on NS-3, to be held in conjunction with SIMUTools 2012. The workshop will be held on March 23rd, 2012, in Sirmione, Italy. Details regarding the important dates, submission guidelines, and the scope of the workshop is available on the WNS3 2012 site.

    The workshop serves as an annual gathering of ns-3 users and developers to share ideas, and brainstorm over future directions for the project. We're inviting conference style full paper submissions, which will be made available in the ACM and EU digital proceedings. Furthermore, we're also hosting an interactive session comprising of demos and posters. So start working on your papers and proposals as soon as possible! If you have any queries, feel free to contact the workshop chairs.

    Like previous editions of the workshop, we also plan to conduct a day long ns-3 developers' meeting around the same time. More details will be out soon. Keep an eye on our mailing lists and website to not miss any announcements.

    Important dates

    Papers submission deadline : December 2, 2011

    Notification of acceptance : January 20, 2012

    Camera-ready deadline : February 10, 2012

    Demos and posters proposal deadline : February 24, 2012

    Workshop in Sirmione : March 23, 2012

    Developers’ meeting : TBA

  • Comic: Deciphering Academic Presentations

    YANC 4: Deciphering academic presentations