My broad area of research is distributed systems and networking. I seek to build systems that both advance the state-of-the-art as well as the state of practice.

I am currently CEO & Co-founder at Feldera, where we are building a powerful incremental compute platform for AI, ML and data teams. It is powered by our award-winning research that allows us to incrementally execute arbitrarily complex SQL programs.

Prior to Feldera, I was a senior researcher at VMware Research (2016-2023), where I led research efforts to improve the scalability, reliability and extensibility of large-scale cluster managers. Anvil is a framework to build formally verified Kubernetes controllers and Sieve automatically tests Kubernetes controllers for reliability issues. DCM makes it easy to build scalable and flexible cluster managers using declarative programming.

As a PhD student (2012-2016), I invented techniques to deliver predictable performance for certain classes of distributed systems. I’m grateful that some of that work has been impactful. C3 ships with ElasticSearch and OpenSearch as the Adaptive Replica Selection feature, and influenced the design of Spotify’s ELS.

I’m a fervent champion of open-source software. My time with the ns-3 network simulator project was a formative part of my career, where I was an active contributor and maintainer between 2009 and 2016. My largest contribution to the project was ns-3’s integration with Click. I was excited to learn that ns-3 and its predecessors were awarded the 2020 ACM SIGCOMM Networking Systems Award.

Selected projects

Most of my projects are open-source and available on my Github page.

  • Anvil: Verifying liveness for cluster management controllers.
    [Best paper award at OSDI ‘24] [code]

  • Sieve: Automatically testing Kubernetes controllers for distributed systems-ey bugs.
    [OSDI ‘22] [HotOS ‘21] [KubeCon NA ‘21 talk] [code]

  • Declarative Cluster Managers (DCM): Combines incremental view maintenance, SQL and constraint programming to build scalable, flexible and powerful cluster managers (we built a high-performance Kubernetes Scheduler with it, among other things).
    [VLDB ‘23] [OSDI ‘20]] [HotOS ‘19] [code]

  • Elmo: Scalable and flexible multicast at line-rate using source-routing. Check out Mellanox’s implementation of Elmo on their Spectrum-2 ASIC.
    [SIGCOMM ‘19] [P4 Summit]

  • Rapid: widely used cluster membership protocols go haywire in the presence of complex failure scenarios (e.g. high packet loss). Rapid instead guarantees stable and strongly consistent membership at scale. Check out its use to scale Akka Cluster to 10K nodes.
    [ATC ‘18] [code] [blog] [Community efforts: go-rapid, swift-rapid]

  • Wisp: decentralized, end-to-end rate limiting and request scheduling for micro-services.
    [SoCC ‘17]

  • C3: a replica selection algorithm for distributed data stores that is robust to performance variability among replicas. It currently ships with ElasticSearch and has influenced the design of Spotify’s Expected Latency Selector.
    [NSDI ‘15] [code]

  • Odin: a software-defined WiFi network, centered around a programmable virtual access point primitive. The project has seen many forks by researchers (a notable effort being the Wi-5 project).
    [ATC ‘14] [HotSDN ‘12] [code]


Selected Publications

A full list of my publications can be seen on my Google Scholar page


Students

I’ve had the privilege of working with some fantastic PhD interns while at VMware Research:

  • Athinagoras Skiadopoulos (Stanford)
  • Xudong Sun (UIUC)
  • Faria Kalim (UIUC)
  • João Loff (IST - Lisbon)
  • Muhammad Shahbaz (Princeton University)
  • Michael Tong (University of Chicago)



Recent professional service

Program committee:

  • SIGCOMM 22
  • OSDI 21
  • ATC 20
  • NSDI 20
  • SOCC 19
  • ATC 18
  • ICDCS 18
  • HotCloud 17

Artifact Evaluation Committee co-chair:

  • OSDI 2021