Comfortably Geek

Lalith Suresh

Archive for the ‘FOSS’ Category

Towards a Scalable and Highly Available HDFS Namenode

with 10 comments

After 3 months of intense hacking, I’m pleased to be writing about a little something I worked on for a project course here at KTH.

The premise

So we’re all familiar with Hadoop, right? It’s the little yellow elephant that provides an excellent platform for distributed computing, which is seeing rapid adoption by the industry, and involvement from major players like Yahoo!, Facebook and recently, Microsoft. Well, Hadoop and friends use the Hadoop Distributed File System (HDFS) as their underlying storage layer. Given the kind of jobs that are expected to run on top of it, HDFS is designed to store large files, and is optimised for throughput as opposed to latency.

HDFS is a single-master-server based distributed file system. Architecturally speaking, HDFS comprises of three important entities:

  1. Clients, who read/write files from/to the filesystem.
  2. Datanodes, which actually store the data (blocks) associated with the files.
  3. The Namenode, which is a central server that stores all the metadata associated with the files, and blocks.

This division between metadata storage and data storage is important, because typical use cases of HDFS are data intensive, and not metadata intensive. That’s fine, but the problem is, if the Namenode crashes, the entire file system becomes inoperable because clients and Datanodes still need the metadata to do anything useful. Furthermore, since the Namenode maintains all the metadata only in memory, the number of files you can store on the filesystem is directly proportional to the amount of RAM the Namenode has. As if that’s not enough, the Namenode will be completely saturated under write intensive workloads, and will be unable to respond to even simple client side queries like “ls”. Have a look at Shvachko’s paper which describes these problems at great length and depth, on which we’ve based our work.

Long story short, the needs of the hour are:

  • High availability for the Namenode, i.e, no single point of failure.
  • Horizontal scalability for the Namenode, i.e, to handle heavier loads, one would need to only add more Namenodes to the system than having to upgrade a single Namenode’s hardware.

Our solution

In order to recover from crashes, the Namenode maintains a journal of all changes that it makes to the metadata. This pretty much involves logging every operation made to disk, and there is quite a huge piece of code related to this as well. However, the database community has been doing journaling, checkpointing and replicated storage since quite a while. So if you haven’t guessed our solution yet, here it is:

“Move all of the Namenode’s metadata storage into an in-memory, replicated, share-nothing distributed database.”

In short, Namenodes themselves are reduced to a stateless frontend to the database, and fetch state into memory only when required. This comes with the added advantage of being able to have multiple stateless Namenodes for the same filesystem namespace. We chose MySQL Cluster as our database because of its wide spread use and stability. So for the filesystem to scale to a larger number of files, one needs to add more MySQL Cluster Datanodes, thus moving the bottleneck from the Namenode’s RAM to the DB’s storage capacity. For the filesystem to handle heavier workloads, one needs to add only more Namenode machines and divide the load amongst them. Another interesting aspect is that if a single Namenode machine has to reboot, it needn’t fetch any state into memory and will be ready for action within a few seconds (although it still has to sync with Datanodes). Another advantage of our design is that the modifications will not affect the clients or Datanodes in anyway, except that we might need to find a way to divide the load among the Namenodes.

How we did it

We first dissected all the internal protocols being used in HDFS, i.e, the client-Namenode, Namenode-Datanode, and client-Datanode protocols. Next, we stripped out all the Namenode code that we didn’t need. This was pretty much the code related to journaling, checkpointing, the secondary Namenode and so forth.

Next, we identified the key data structures we needed to move to the DB. We picked the two most memory intensive data-structures to migrate first: the Inodes, and the Blocks.

Since we were heavily time constrained (three months to deliver the project and the report), we decided to focus on functional correctness first, and then optimise later. So the easiest course of action seemed to be to modify the lowest levels of the call chain, replacing reads/writes from/to memory with query, insert, update and delete operations on the DB. We developed two helper classes, one each for Inodes and Blocks, and interfaced with the DB through these methods. We used the ClusterJ connector to talk to MySQL. This obviously meant that we needed a flat row representation for Inodes and Blocks in the DB, and we had some other problems to think of as well on the way. How do we index Inodes? How do we index Blocks? What about Triplets?

All in all, we tackled the problem of scaling the Namenode with a set of design decisions which we later found to be consistent with Shvacko’s update paper on the Namenode’s scalability, except that he suggests using HBase as the DB.

Current status

  • Multiple stateless Namenodes run merrily, which store Inode and Block related metadata in MySQL Cluster. As a validation test, Clients can do an “ls” query to any Namenode and see a consistent view of the filesystem regardless of which Namenode updated the DB with the content.
  • We’re trying to ensure functional correctness using the HDFS unit tests. We got the most important ones to pass, and decided to keep some more bug fixing until later because we needed to evaluate the system as part of the course.
  • We’ve been evaluating the system using the Synthetic Load Generator. Horizontal scalability has been clearly achieved; adding more Namenodes improves the average number of operations per second for different work loads. With write intensive work loads, the scalability is linear in terms of total operations/sec that are executed.

Current limitations

Obviously, our work isn’t rainbows and sunshine; there’s a long way to go. Here’s what we don’t have yet and are currently addressing:

  • Performance improvements. With a single load-generator thread throwing requests at the Namenode, we’re within a 10th of the original Namenode’s performance because read/writes from/to memory now go over a network to a database cluster (which is OK, I guess). But with more LoadGen threads, we’re experiencing a hefty bottleneck, which I’ll describe in the next point.
  • The Namenode isn’t fully stateless yet. The most important data structures we’re yet to move are the DatanodeDescriptor entities and the file leases. There’ll surely be more, but these are the most crucial ones. Once full statelessness is achieved, we can eliminate the read-write locks in the code which are absolutely not needed any more in our implementation (the Namenode currently uses a multiple-reader-single-writer concurrency model). Profiling experiments indicated that the Namenode spends around 70% of its time waiting to acquire write locks. If we keep the Namenode fully stateless, we can wrap FSNamesystem operations into Database transactions which can be batched, and let MySQL cluster handle serialisability for us (which can handle write-heavy transactions really well). We can even break away from the single-writer model that the Namenode currently uses. Will this lead to a higher throughput for write operations than the original Namenode? Maybe. :)
  • Clients and Datanodes have to be statically partitioned as of now (it sufficed for our evaluations). We need a way for them to pick a random Namenode to perform their operations with.

Talk is cheap, show me the code!

The code is publicly available here for thy scrutiny. You’ll also need to have a MySQL cluster setup in order to test this (we have a hard coded set of defaults in DBConnector.java which you can politely ignore). :)  Here’s our presentation on it as well. We’ve dubbed our prototype implementation KTHFS (because we’re students at KTH, but yes, no points for creativity on that one).

Future work

As an academic obligation, here’s future work (and no I’m not going to write stuff we’ll never do).

One member (not me) from the team will be continuing this work as part of his Masters thesis, and plans to address the above mentioned limitations as part of his work. I’ll try to contribute as well during my free time (what are weekends for anyway?). Let’s see how this goes. :)

Written by lalithsuresh

December 15, 2011 at 11:23 am

How to extend ns-3 for your research

with 8 comments

Having been busy with coursework lately, I hadn’t gone through our users’ list in a while. Wading through a week’s worth of posts today, it seems to me like a good deal of questions are from users who are trying to get started with extending ns-3. This is indeed quite expected; as a research tool, ns-3 is most useful only when built upon. These extensions usually take one of the following forms:

  1. a tweak to an existing protocol to make it simulate some specific scenario (try searching for “attacks” on our users list),
  2. adding some functionality X to an existing module, (for instance, RRC messages support for LTE)
  3. or writing an entirely new module from scratch.

Before you do *anything* with ns-3, go through the tutorials first.

Now, for cases 1 and 2 mentioned above, the *only* way to proceed is:

  1. Go through the literature about what you’re trying to implement — “What do I want to achieve?”
  2. Understand the scope and limitations of the ns-3 module you’re trying to deal with (go through the model documentation at least) — “Does ns-3 have the necessary base for me to build on top of?”
  3. If the answer to the above is “yes”, then start reading through the respective module’s code to figure out where you’ll need to insert your modifications. — “Where does my extension/tweak fit within the existing source code?”
  4. Implement.
  5. Profit.

Case 3, on the other hand, requires a lot more work:

  1. Go through the literature about what you’re trying to implement — “What do I want to achieve?”
  2. Understand how your module would fit within ns-3. This is usually the tricky part. To this end, it’s very important to understand how packets flow through a node within ns-3. This figure from our manual is usually the only thing you’ll need to know to get started.
  3. At this point, I’ll make things easier for myself and assume that you’re going to implement something that fits into the above mentioned architecture (rather than trying to modify the architecture itself). The first step is as simple as deriving from the right class. This gives you the virtual methods you need to implement in order to maintain a particular component’s semantics. So if you’re trying to write a new application, derive from ns3::Application. If it’s a new routing protocol, derive from ns3::Ipv4RoutingProtocol or ns3::Ipv6RoutingProtocol. If it’s a new NetDevice, derive from ns3::NetDevice. The easiest thing to do is to find another example of the component type you’re trying to develop and reflect its basic structure.
  4. Now to get started writing your new module, have a look at Gustavo Carnerio’s create-module.py script (inside src/)  which generates a skeleton for your new module. This includes the necessary sub-folders for the module, and also the all important wscript file. For most use cases, it would suffice to peek into some other module’s wscript file to get an idea of what to do. If you’re going to need some fancy external libraries, you’ll need to go through the waf documentation a bit. Look into src/click/wscript to get an idea of how to do external linking.
  5. Now once you start developing your new simulation model, you’ll need to attach this object to a node to get it to do something. This mode of attachment varies from component to component. Some objects are ‘aggregated’ to the node, some are added to a list of similar components (like applications being added to a node’s ‘ApplicationList’) and so forth. The best places to look in order to understand this are the helpers for each module. So for instance, if you want to figure out how to add your routing protocol to a node, look at src/olsr/helper/ to get an idea.
  6. Lastly, you’ll need to write simulation scripts to see your module in action. Copying off and editing existing example scripts from the examples/ folder or the src/*/examples/ folders should suffice for most cases.
  7. If you’re going to propose this new module for merge, look at our contributing code page. Keep in mind that we won’t merge code which doesn’t have any documentation, or tests (validation or unit tests, as is applicable).
  8. Merge. :)

Written by lalithsuresh

October 13, 2011 at 12:34 am

Posted in FOSS, NS-3

Tagged with , ,

HOWTO: Getting started with ns-3-click – Part II

leave a comment »

We now move into part II of the ns-3-click tutorial series wherein I’ll walk you through an ns-3 script and explain how to load your Click scripts onto ns-3 nodes. This tutorial goes through the following steps:

  1. Walkthrough of an example Click graph from ns-3-click.
  2. Using the Click graph in an ns-3 simulation.

 

Walkthrough of an example Click graph from ns-3-click

The Click script I’ll be explaining below is available in current ns-3-dev and stable releases from ns-3.11 onwards. You can find it in src/click/examples/nsclick-lan-single-interface.click.

When working with ns-3-click, you will need to handle all layer 3 functionalities expected of a networked node from Click. The Click script that we’ll be talking about here provides exactly this. It handles ARP, and forwards packets up and down the stack as required. It has a single network interface (eth0) to send and receive packets, and has a kernel interface (tap0) to send/receive packets from the kernel (in ns-3, this corresponds to communicating with layer 4).

We first describe the interface to tap0 with the following functionalities:

  1. All packets received from tap0 should be forwarded down to the stack.
  2. All packets received from below the stack, destined to us, should be sent up the stack via tap0.

elementclass TapSimHost {
$dev |

// Packets go to tap0, which sends them to the kernel
input[0]
-> ToDump(tokernel.pcap,2000,IP,PER_NODE 1)
-> ToSimDevice($dev,IP);

// Packets sent out by the kernel get pushed outside
FromSimDevice($dev,4096)
-> CheckIPHeader2
-> ToDump(fromkernel.pcap,2000,IP,PER_NODE 1)
-> GetIPAddress(16)
-> [0]output;
}

kernel::TapSimHost(tap0);

 

The above snippet does exactly what we’ve described so far. We create an element class which has a single input and output. Packets received on the input are plumbed to tap0 (because $dev is now tap0, as per the instantiation in the last line). Packets received from tap0 are pushed outside. We now describe a LAN host, which will handle ARP, and check the destination IP to see if we should receive the packet.

elementclass LanSimHost {
$ipaddr, $hwaddr |

cl::Classifier(12/0806 20/0001,12/0806 20/0002, -);
forhost::IPClassifier(dst host $ipaddr,-);
arpquerier::ARPQuerier(eth0);
arpresponder::ARPResponder(eth0);

ethout::Queue
-> ToDump(out_eth0.pcap,PER_NODE 1)
-> ToSimDevice(eth0);

// All packets received on eth0 are silently
// dropped if they are destined for another location
FromSimDevice(eth0,4096)
-> ToDump(in_eth0.pcap,PER_NODE 1,ENCAP ETHER)
-> cl;

// ARP queries from other nodes go to the ARP responder element
cl[0] -> arpresponder;

// ARP responses go to our ARP query element
cl[1] -> [1]arpquerier;

// All other packets get checked whether they are meant for us
cl[2]
-> Strip(14)
-> CheckIPHeader2
-> MarkIPHeader
-> GetIPAddress(16) // Sets destination IP address annotation
-> forhost;

// Packets for us are pushed outside
forhost[0]
-> [0]output;

// Packets for other folks or broadcast
// packets get sent to output 1
forhost[1]
-> ToDump(discard.pcap,2000,PER_NODE 1,ENCAP IP)
-> [1]output;

// Incoming packets get pushed into the ARP query module
input[0]
-> arpquerier;

// Both the ARP query and response modules send data out to
// the simulated network device, eth0.
arpquerier
-> ToDump(out_arpquery.pcap,PER_NODE 1)
-> ethout;

arpresponder
-> ToDump(out_arprespond.pcap,PER_NODE 1)
-> ethout;

}

lan::LanSimHost(eth0:ip,eth0:eth);

 

Now that we have a LanSimHost type ready, and instantiated (in the last line indicated above), we perform the final plumbing required to connect our LanSimHost to our kernel tap device:

// Users can do some processing between the two elements
lan[0] -> kernel;
kernel -> lan;
// Packets for others or broadcasts are discarded
lan[1] -> Discard;

This concludes our description of the nsclick-lan-single-interface.click file. Let’s now describe an ns-3 simulation script for our scenario.

 

Using the Click graph in an ns-3 script

I’ll now describe a simple ns-3 script which makes use of the above described Click graph. Not surprisingly, the script is named nsclick-simple-lan.cc and can be found within src/click/examples/. The simulation scenario is a simple one: two nodes A and B connected via a CSMA channel, with A sending B a stream of packets using a TCP connection. A is Click based, whereas B is a normal ns-3 node.

The first step would be to create the nodes.

NodeContainer csmaNodes;
csmaNodes.Create (2);

Next, we create a CSMA channel.

CsmaHelper csma;
csma.SetChannelAttribute ("DataRate", DataRateValue (DataRate (5000000)));
csma.SetChannelAttribute ("Delay", TimeValue (MilliSeconds (2)));
NetDeviceContainer csmaDevices = csma.Install (csmaNodes);

We then install a normal internet stack on node B.

InternetStackHelper internet;
internet.Install (csmaNodes.Get (1));

And then setup a Click based internet stack on node A. We need to specify the Click script that the particular node is supposed to use, and in the event that we require a Click based node to run a traffic generator on top, we need to specify a routing table element for the node to use. This can be seen by the name “rt” at the end of the nsclick-lan-single-interface.click file.

ClickInternetStackHelper clickinternet;
clickinternet.SetClickFile (csmaNodes.Get (0), "src/click/examples/nsclick-lan-single-interface.click");
clickinternet.SetRoutingTableElement (csmaNodes.Get (0), "rt");
clickinternet.Install (csmaNodes.Get (0));

Now that all the nodes have an internet stack, we assign IPv4 addresses to all the network interfaces.

Ipv4AddressHelper ipv4;
ipv4.SetBase ("172.16.1.0", "255.255.255.0");
ipv4.Assign (csmaDevices);

And then, we setup the traffic generators for talking between node A and node B.

Address LocalAddress (InetSocketAddress (Ipv4Address::GetAny (), 50000));
PacketSinkHelper packetSinkHelper ("ns3::TcpSocketFactory", LocalAddress);
ApplicationContainer recvapp = packetSinkHelper.Install (csmaNodes.Get (1));
recvapp.Start (Seconds (5.0));
recvapp.Stop (Seconds (10.0));
OnOffHelper onOffHelper ("ns3::TcpSocketFactory", Address ());
onOffHelper.SetAttribute ("OnTime", RandomVariableValue (ConstantVariable (1)));
onOffHelper.SetAttribute ("OffTime", RandomVariableValue (ConstantVariable (0)));

ApplicationContainer appcont;
addressValue remoteAddress (InetSocketAddress (Ipv4Address ("172.16.1.2"), 50000));
onOffHelper.SetAttribute ("Remote", remoteAddress);
appcont.Add (onOffHelper.Install (csmaNodes.Get (0)));
appcont.Start (Seconds (5.0));
appcont.Stop (Seconds (10.0));

Lastly, we enable PCAP tracing on all the CSMA NetDevices in the scenario.

csma.EnablePcap ("nsclick-simple-lan", csmaDevices, false);

And to conclude the script, we specify the running time for the simulation to be 20 seconds, and call Simulator::Run(). Don’t forget to call Simulator::Destroy() lest tools like Valgrind start screaming about memory leaks.

Simulator::Stop (Seconds (20.0));
Simulator::Run ();
Simulator::Destroy ();
return 0;

 

To see the results of running our simulation, execute the below in your terminal from the ns-3-dev top level directory once you’ve built Click as described in the previous article.


$: ./waf --run nsclick-simple-lan

Have a look at the resulting pcap traces (nsclick-simple-lan-0-[0,1].pcap) using wireshark or tcpdump to see what happened through the simulation.

Hope you found this little walkthrough helpful. If you find any bugs with ns-3-click, please don’t hesitate to file a bug report on our bugzilla. :)

Written by lalithsuresh

June 1, 2011 at 11:31 pm

Posted in FOSS, NS-3

Tagged with , ,

HOWTO: Getting started with ns-3-click – Part I

with 5 comments

Since its development, I’ve been seeing a lot of requests for an ns-3-click 101. So with no further ado, here’s the first in a series of tutorials to help you get an idea of how to go about using ns-3-click. In this article, I’ll provide an idea of what ns-3-click is and how to install it .

1. What is ns-3-click?

ns-3-click, or the NS-3 Click Integration, is a feature of the ns-3 tool which allows a user to use a Click Modular Router instance to handle an ns-3 node’s layer 3 functionality. Click is an architecture for designing highly flexible router configurations. Basically, it offers a large number of fine grained packet processing units called “elements”, which can be put connected in various combinations into a Click graph. A Click graph defines a particular configuration for a router. The motivation for bringing Click into ns-3 is simple. Click users get to test their Click graphs in a powerful simulation environment which ns-3 offers, and ns-3 benefits from the large feature set that Click brings with it in the form of elements. In ns-3-click, the design choice was to entirely delegate ns-3′s layer 3 functionality to Click. This means that an ns-3 node running a Click router will now have to use Click’s implementations of ARP, routing tables and so forth.

So before you get started with ns-3-click, I suggest going through [0] and [1]. The first paper should tell you everything you need to know about what Click is, and the second one will inform you about where Click fits into ns-3. The latter will help you understand what ns-3-click can or cannot do for you.

2. Installation

By now, you know what ns-3-click is, and its time to get your hands dirty. The first step is to download and build Click.

$: git clone git://read.cs.ucla.edu/git/click DIR
$: cd DIR
$: ./configure --enable-nsclick --enable-userlevel

Note: If you require additional modules, enable them as required using the –enable-<module> flag.

Now build Click:

$: make

Once this is complete, we need to build ns-3 and point it to the Click source we’ve just compiled. So let’s proceed to fetch and build ns-3. Note that ns-3-click was merged post ns-3.10, and will be released with ns-3.11. It is currently available in ns-3-dev.

$: hg clone http://code.nsnam.org/ns-3-dev
$: cd ns-3-dev
$: ./waf configure --with-nsclick=<path-to-click-source> --enable-examples

Once the last command has finished executing, you should see a list of features/modules that have been enabled. At this point, you should see the following line:

NS-3 Click Integration        : enabled

Now, let’s build ns-3:

$: ./waf build

If all goes well, ns-3 should now be built successfully. To test whether things went well, try running one of the example scripts.

$: ./waf --run nsclick-simple-lan

This should generate some PCAP traces (nsclick-simple-lan-*.pcap). If you see packets being exchanged in there, then you’re good to go!

This concludes part I of the ns-3-click tutorial. In the next part, I’ll provide a code walk through of a simple example script that uses ns-3-click. If you find any bugs with ns-3-click, please don’t hesitate to file a bug report on our bugzilla. :)

References

[0] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek, “The Click Modular Router”. ACM Transactions on Computer Systems 18(3), August 2000, pages 263-297. (Paper from the MIT-PDOS page)

[1] Lalith Suresh P., Ruben Merz, ”NS-3-Click: Click Modular Router Integration for NS-3”. In Proc. of 3rd International ICST Workshop on NS-3 (WNS3), Barcelona, Spain. March, 2011. paper

Written by lalithsuresh

May 22, 2011 at 9:44 pm

Posted in FOSS, NS-3

Tagged with , ,

NS-3 Summer of Code 2011: Results Announced!

leave a comment »

The wait is over, and we hereby announce the students selected for the NS-3 Summer of Code 2011 programme!

  • Ashwin Narayan with “Click-MAC extensions for ns-3-click”. Mentored by Ruben Merz and myself.
  • Pankaj Gupta with “LTE-RRC extensions”. Mentored by Giuseppe Piro and Francesco Capozzi.
  • Atishay Jain with “IPv6 Global Routing”. Mentored by Tom Henderson and Mitch Watrous.

Congratulations to all the students. Hoping to see a lot of good code come out of this effort and most importantly, some long term contributors to the ns-3 project itself. :)

Written by lalithsuresh

April 29, 2011 at 8:33 pm

Posted in FOSS, GSoC, NS-3

Tagged with ,

Follow

Get every new post delivered to your Inbox.

Join 1,155 other followers