55 private links
How they SRE is a curated collection of SRE resources.
The author suggests defining a single place where SRE should document temporary information on systems and their operation.
The author points out shortcomings of error-budget-based approaches in SRE.
Simple, reliable messaging. It takes a lot to support this statement. For 10 years WhatsApp demonstrated unprecedented reliability and availability, serving over 1.5B users. There is absolutely no way to reproduce interactions between all of them, within the cluster spanning over 10,000 nodes and multiple data centers. Investigations must be done on a live system without disturbing connected users. If there are repairs needed, it has to be done on the fly.
This article builds upon Vivek Rau’s chapter “Eliminating Toil” in Site Reliability Engineering: How Google Runs Production Systems [1]. We begin by recapping Vivek’s definition of toil and Google’s approach to balancing operational work with engineering project work. [1] B. Beyer, C. Jones, J. Petoff, and N. Murphy, eds., Site Reli- ability Engineering (O’Reilly Media, 2016).
Critical but oft-neglected service metrics that every SRE and product owner should care about.