Simple, reliable messaging. It takes a lot to support this statement. For 10 years WhatsApp demonstrated unprecedented reliability and availability, serving over 1.5B users. There is absolutely no way to reproduce interactions between all of them, within the cluster spanning over 10,000 nodes and multiple data centers. Investigations must be done on a live system without disturbing connected users. If there are repairs needed, it has to be done on the fly.
This article builds upon Vivek Rau’s chapter “Eliminating Toil” in Site Reliability Engineering: How Google Runs Production Systems . We begin by recapping Vivek’s definition of toil and Google’s approach to balancing operational work with engineering project work.  B. Beyer, C. Jones, J. Petoff, and N. Murphy, eds., Site Reli- ability Engineering (O’Reilly Media, 2016).