Magic Numbers in Distributed Systems
I'm officially done with the first half of my masters as of now, and it's been a fun swim so far in the violent sea of distributed systems, which by the way, is a really big zoo where anything can go wrong. To quote Leslie Lamport:
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable"
Thus, researchers over the last few decades have gone through great lengths to design distributed algorithms that ensure correct behaviour in the light of node failures and random activity by other processes. But if you're a student and you're going through these algorithms, it's not entirely obvious what some of the magic numbers mean at times, and what the intuition is behind them that ensures that an algorithm works correctly. This post is meant to help in that direction. As per distributed systems convention, N refers to the number of processes in the system and F refers to the number of failures that can be tolerated.
- \(N/2\)
- \((N+F)/2\)
- \( > (N+2F)/2\)
- \(F\)
- \(2F\)
- \(N-F\)
That's it for now.
Comments