It is a truth universally acknowledged that systems do not run themselves. How, then, should a system—particularly a complex computing system that operates at a large scale—be run? Historically, companies have employed systems administrators to run complex computing systems. This systems administrator, or sysadmin, approach involves assembling existing software components and deploying them to work together to produce a service. Sysadmins are then tasked with running the service and responding to events and updates as they occur.
As the system grows in complexity and traffic volume, generating a corresponding increase in events and updates, the sysadmin team grows to absorb the additional work. The sysadmin model of service management has several advantages. For companies deciding how to run and staff a service, this approach is relatively easy to implement: as a familiar industry paradigm, there are many examples from which to learn and emulate.
A relevant talent pool is already widely available. These fall broadly into two categories: direct costs and indirect costs. Direct costs are neither subtle nor ambiguous.
These costs arise from the fact that the two teams are quite different in background, skill set, and incentives. They use different vocabulary to describe situations; they carry different assumptions about both risk and possibilities for technical solutions; they have different assumptions about the target level of product stability. The split between the groups can easily become one of not just incentives, but also communication, goals, and eventually, trust and respect. This outcome is a pathology. Traditional operations teams and their counterparts in product development thus often end up in conflict, most visibly over how quickly software can be released to production.
At their core, the development teams want to launch new features and see them adopted by users.
Chapter 1 | An Introduction to Site Reliability Engineering (SRE) | VictorOps SRE Guide | VictorOps
And because their vocabulary and risk assumptions differ, both groups often resort to a familiar form of trench warfare to advance their interests. The ops team attempts to safeguard the running system against the risk of change by introducing launch and change gates. For example, launch reviews may contain an explicit check for every problem that has ever caused an outage in the past—that could be an arbitrarily long list, with not all elements providing equal value. The dev team quickly learns how to respond. They have fewer "launches" and more "flag flips," "incremental updates," or "cherrypicks.
Google has chosen to run our systems with a different approach: our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins. What exactly is Site Reliability Engineering, as it has come to be defined at Google? My explanation is simple: SRE is what happens when you ask a software engineer to design an operations team. When I joined Google in and was tasked with running a "Production Team" of seven engineers, my entire life up to that point had been software engineering.
As a whole, SREs can be broken down into two main categories. By far, UNIX system internals and networking Layer 1 to Layer 3 expertise are the two most common types of alternate technical skills we seek. Common to all SREs is the belief in and aptitude for developing software systems to solve complex problems. Within SRE, we track the career progress of both groups closely, and have to date found no practical difference in performance between engineers from the two tracks. In fact, the somewhat diverse background of the SRE team frequently results in clever, high-quality systems that are clearly the product of the synthesis of several skill sets.
The result of our approach to hiring for SRE is that we end up with a team of people who a will quickly become bored by performing tasks by hand, and b have the skill set necessary to write software to replace their previously manual work, even when the solution is complicated. SREs also end up sharing academic and intellectual background with the rest of the development organization. Therefore, SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.
By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload. Eventually, a traditional ops-focused group scales linearly with service size: if the products supported by the service succeed, the operational load will grow with traffic. That means hiring more people to do the same tasks over and over again.
To avoid this fate, the team tasked with managing a service needs to code or it will drown. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable. This cap is an upper bound; over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself: we want systems that are automatic , not just automated.
In practice, scale and new features keep SREs on their toes. So how do we enforce that threshold? In the first place, we have to measure how SRE time is spent. Often this means shifting some of the operations burden back to the development team, or adding staff to the team without assigning that team additional operational responsibilities.
Consciously maintaining this balance between ops and development work allows us to ensure that SREs have the bandwidth to engage in creative, autonomous engineering, while still retaining the wisdom gleaned from the operations side of running a service. Such teams are relatively inexpensive—supporting the same service with an ops-oriented team would require a significantly larger number of people.
Instead, the number of SREs needed to run, maintain, and improve a system scales sublinearly with the size of the system. Despite these net gains, the SRE model is characterized by its own distinct set of challenges. One continual challenge Google faces is hiring SREs: not only does SRE compete for the same candidates as the product development hiring pipeline, but the fact that we set the hiring bar so high in terms of both coding and system engineering skills means that our hiring pool is necessarily small.
As our discipline is relatively new and unique, not much industry information exists on how to build and manage an SRE team although hopefully this book will make strides in that direction! And once an SRE team is in place, their potentially unorthodox approaches to service management require strong management support. For example, the decision to stop releases for the remainder of the quarter once an error budget is depleted might not be embraced by a product development team unless mandated by their management.
Introduction to Reliability Engineering
While the nuances of workflows, priorities, and day-to-day operations vary from SRE team to SRE team, all share a set of basic responsibilities for the service s they support, and adhere to the same core tenets. In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service s.
We have codified rules of engagement and principles for how SRE teams interact with their environment—not only the production environment, but also the product development teams, the testing teams, the users, and so on. Those rules and work practices help us to maintain our focus on engineering work, as opposed to operations work.
Their remaining time should be spent using their coding skills on project work. In practice, this is accomplished by monitoring the amount of operational work being done by SREs, and redirecting excess operational work to the product development teams: reassigning bugs and tickets to development managers, [re]integrating developers into on-call pager rotations, and so on.
Get this edition
When they are focused on operations work, on average, SREs should receive a maximum of two events per 8—hour on-call shift. This target volume gives the on-call engineer enough time to handle the event accurately and quickly, clean up and restore normal service, and then conduct a postmortem. Conversely, if on-call SREs consistently receive fewer than one event per shift, keeping them on point is a waste of their time.
Postmortems should be written for all significant incidents, regardless of whether or not they paged; postmortems that did not trigger a page are even more valuable, as they likely point to clear monitoring gaps. This investigation should establish what happened in detail, find all root causes of the event, and assign actions to correct the problem or improve how it is addressed next time. Google operates under a blame-free postmortem culture , with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.
Product development and SRE teams can enjoy a productive working relationship by eliminating the structural conflict in their respective goals. Basem El-Haik. Champion's Practical Six Sigma Summary. Carl E. Qamar Mahboob. Modeling Online Auctions. Wolfgang Jank. Pricing Analytics. Walter R. The Disassembly Line: Balancing and Modeling. Seamus M. Reliability Engineering. Elsayed A. Smart Grid. James A. Robustness Development and Reliability Growth.
John P. Tadeusz Sawik. Risk Modeling, Assessment, and Management. Yacov Y. Statistical Robust Design. Magnus Arner. The Logic of Logistics.
David Simchi-Levi. Thomas Pyzdek. Design for Six Sigma Statistics. Andrew Sleeper. Uncertainty and Optimization in Structural Mechanics. Abdelkhalak El Hami.