My broad area of research is distributed systems and networking. I seek to build systems that both advance the state-of-the-art as well as the state of practice.

At VMware Research, I’m leading efforts to simplify cluster management for large-scale distributed systems. My most recent endeavor is the DCM project, which simplifies cluster manager development using declarative programming and code generation.

As a PhD student, I invented techniques to deliver predictable performance for certain classes of distributed systems. I’m grateful that some of that work has been impactful. C3 ships with ElasticSearch and has influenced the design of Spotify’s ELS.

I’m a fervent champion of open-source software. My time with the ns-3 network simulator project was a formative part of my career, where I was an active contributor and maintainer between 2009 and 2016. My largest contribution to the project was ns-3’s integration with Click. I was excited to learn that ns-3 and its predecessors were awarded the 2020 ACM SIGCOMM Networking Systems Award.


Interns

I’m always on the lookout for motivated PhD students to work with. Reach out to me by email (suresh dot lalith at gmail) if you’re interested.

I’ve had the priviledge of working with some fantastic interns at VMware Research:

  • Xudong Sun (UUIC)
  • Faria Kalim (UUIC)
  • João Loff (IST - Lisbon)
  • Muhammad Shahbaz (Princeton University)
  • Michael Tong (University of Chicago)


Selected projects

Most of my projects are open-source and available on my Github page.

  • Declarative Cluster Managers (DCM): Why write cluster management code by hand when you can code generate the required implementation instead? Answer: improved scalability, decision quality, and flexibility with an order of magnitude less code.
    [OSDI ‘20] [HotOS ‘19] [code]

  • Elmo: Scalable and flexible multicast at line-rate using source-routing. Check out Mellanox’s implementation of Elmo on their Spectrum-2 ASIC.
    [SIGCOMM ‘19] [P4 Summit]

  • Rapid: widely used cluster membership protocols go haywire in the presence of complex failure scenarios (e.g. high packet loss). Rapid instead guarantees stable and strongly consistent membership at scale. Check out its use to scale Akka Cluster to 10K nodes.
    [ATC ‘18] [code] [blog] [Community efforts: go-rapid, swift-rapid]

  • Wisp: decentralized, end-to-end rate limiting and request scheduling for micro-services.
    [SoCC ‘17]

  • C3: a replica selection algorithm for distributed data stores that is robust to performance variability among replicas. It currently ships with ElasticSearch and has influenced the design of Spotify’s Expected Latency Selector.
    [NSDI ‘15] [code]

  • Odin: a software-defined WiFi network, centered around a programmable virtual access point primitive. The project has seen many forks by researchers (a notable effort being the Wi-5 project).
    [ATC ‘14] [HotSDN ‘12] [code]


Selected Publications

A full list of my publications can be seen on my Google Scholar page