What Does Site Reliability Engineering Mean?
Site reliability engineering (SRE) is an approach to website operations that uses techniques from software engineering to build more reliable websites. Site reliability engineering was first developed at Google in 2003. The term is related to DevOps, which also mixes software engineering with system administration, but DevOps involves automating manual tasks.
Techopedia Explains Site Reliability Engineering
Site reliability engineering involves the use of software engineering techniques, including algorithms, data structures, performance and programming languages to achieve web applications that are highly reliable. The approach was first developed at Google in 2003.
In an interview, Google vice president of engineering Ben Traynor said that the company hired 50-50 mix of people with backgrounds in both software engineering and system administration for its SRE teams. Google assigns small SRE teams to major projects. Traynor attributed Google's remarkable uptime to the automation of many site operations activities. While failures occasionally happen, they are fixed quickly because the SRE team has automated so many tasks beforehand.
Google has also taken inspiration from role-playing games in the way that it has structured operation readiness drills to test engineers in the case of failures that do require automation. The company calls these exercises "Wheel of Misfortune," where one employee plays the role of the system and one plays the role of the on-call engineer. Traynor said this approach engaged engineers to think about reliability more than conventional drills did.
SRE is similar to DevOps, but the latter focuses on automating the deployment of systems generally, while SRE focuses specifically on reliability.