My broad area of research is distributed systems and networking. I seek to build systems that both advance the state-of-the-art as well as the state of practice.

I am currently CEO & Co-founder at Feldera, where we are building an engine that brings real-time SQL powers to your lakehouse.

Prior to Feldera, I was a senior researcher at VMware Research (2016-2023), where I led research efforts to improve the scalability, reliability and extensibility of large-scale cluster managers. Sieve automatically tests Kubernetes controllers for reliability issues, and Anvil is a framework to build formally verified Kubernetes controllers. DCM makes it easy to build scalable and flexible cluster managers using declarative programming.

As a PhD student (2012-2016), I invented techniques to deliver predictable performance for certain classes of distributed systems. I’m grateful that some of that work has been impactful. C3 ships with ElasticSearch and OpenSearch, and influenced the design of Spotify’s ELS.

I’m a fervent champion of open-source software. My time with the ns-3 network simulator project was a formative part of my career, where I was an active contributor and maintainer between 2009 and 2016. My largest contribution to the project was ns-3’s integration with Click. I was excited to learn that ns-3 and its predecessors were awarded the 2020 ACM SIGCOMM Networking Systems Award.


Interns

I’m always on the lookout for motivated PhD students to work with. Reach out to me by email (suresh dot lalith at gmail) if you’re interested.

I’ve had the privilege of working with some fantastic interns at VMware Research:

  • Athinagoras Skiadopoulos (Stanford)
  • Xudong Sun (UIUC)
  • Faria Kalim (UIUC)
  • João Loff (IST - Lisbon)
  • Muhammad Shahbaz (Princeton University)
  • Michael Tong (University of Chicago)


Selected projects

Most of my projects are open-source and available on my Github page.

  • Anvil: Verifying liveness for cluster management controllers.
    [OSDI ‘24] [code]

  • Sieve: Automatically testing Kubernetes controllers for distributed systems-ey bugs.
    [OSDI ‘22] [HotOS ‘21] [KubeCon NA ‘21 talk] [code]

  • Declarative Cluster Managers (DCM): Why write cluster management code by hand when you can code generate the required implementation instead? Answer: improved scalability, decision quality, and flexibility with an order of magnitude less code.
    [VLDB ‘23] [OSDI ‘20]] [HotOS ‘19] [code]

  • Elmo: Scalable and flexible multicast at line-rate using source-routing. Check out Mellanox’s implementation of Elmo on their Spectrum-2 ASIC.
    [SIGCOMM ‘19] [P4 Summit]

  • Rapid: widely used cluster membership protocols go haywire in the presence of complex failure scenarios (e.g. high packet loss). Rapid instead guarantees stable and strongly consistent membership at scale. Check out its use to scale Akka Cluster to 10K nodes.
    [ATC ‘18] [code] [blog] [Community efforts: go-rapid, swift-rapid]

  • Wisp: decentralized, end-to-end rate limiting and request scheduling for micro-services.
    [SoCC ‘17]

  • C3: a replica selection algorithm for distributed data stores that is robust to performance variability among replicas. It currently ships with ElasticSearch and has influenced the design of Spotify’s Expected Latency Selector.
    [NSDI ‘15] [code]

  • Odin: a software-defined WiFi network, centered around a programmable virtual access point primitive. The project has seen many forks by researchers (a notable effort being the Wi-5 project).
    [ATC ‘14] [HotSDN ‘12] [code]


Selected Publications

A full list of my publications can be seen on my Google Scholar page


Recent professional service

Program committee:

  • SIGCOMM 22
  • OSDI 21
  • ATC 20
  • NSDI 20
  • SOCC 19
  • ATC 18
  • ICDCS 18
  • HotCloud 17

Artifact Evaluation Committee co-chair:

  • OSDI 2021