Are you looking for an interesting and competitive career that allows you to experience first-hand the full power of DevOps—and even go a few steps beyond? A site reliability engineer role might be a great fit.
What is site reliability engineering?
Site reliability engineering (SRE) was born at Google in 2003, prior to the DevOps movement, when the first team of software engineers was tasked to make Google’s already large-scale sites more reliable, efficient, and scalable. The practices they developed responded so well to Google’s needs that other big tech companies, such as Amazon and Netflix, also adopted them and brought new practices to the table.
SRE eventually became a full-fledged IT domain, aimed at developing automated solutions for operational aspects such as on-call monitoring, performance and capacity planning, and disaster response. It complements beautifully other core DevOps practices, such as continuous delivery and infrastructure automation.
“Site reliability engineers create a bridge between development and operations by applying a software engineering mindset to system administration topics.”
Google described its experience and findings in a book, “Site Reliability Engineering - How Google Runs Production Systems”, which is available online for free. The book introduces powerful concepts such as error budgets and service level objectives, and it describes Google's practices around automation, handling emergencies and incidents, troubleshooting and monitoring, managing risk, and building scalable systems. It also discusses aspects such as organizing the SRE team and on-call duties.
What does a site reliability engineer do?
Ben Traynor, VP of engineering at Google and founder of Google SRE, pinpointed the essence of the SRE role in this interview:
“SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor. In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.”
Site reliability engineers create a bridge between development and operations by applying a software engineering mindset to system administration topics. They split their time between operations/on-call duties and developing systems and software that help increase site reliability and performance. Google puts a lot of emphasis on SREs not spending more than 50% of their time on operations and considers any violation of this rule a sign of system ill-health.
The ultimate goal for SREs is to, as Google puts it, “automate their way out a job.” One important way to do this is to build self-service tools for user groups that rely on their services (e.g., automatic provisioning of test environments, logs, and statistics visualization). Doing so reduces work in progress for all parties, allows developers to focus exclusively on feature development, and lets them focus on the next task to automate.
SREs collaborate closely with product developers to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability. They also work with release engineers to ensure that the software delivery pipeline is as efficient as possible.
To gain better insight on what it means to be an SRE at Google, watch the testimonials of these five Google SREs.
Should you consider this career path?
You can become an SRE regardless of your background in software or systems engineering, as long as you have solid foundations in both and a strong incentive for improving and automating. If you are a systems engineer and want to improve your programming skills, or if you are a software engineer and want to learn how to manage large-scale systems, this role is for you. Deepening your knowledge in both areas will give you a competitive edge and more flexibility for the future.
If you are a “continuous improvement aficionado” like me, the SRE role will allow you to gain the system-wide view: You will understand how the software delivery value chain works and know how to ensure agility and reliability and deliver more value overall. It can be highly motivating and offer an ideal position to demonstrate the value you bring to your organization.
There is also no better role for staying in touch with the newest developments in the DevOps world and expanding your knowledge and skills in high-demand areas such as infrastructure automation, release engineering, and continuous delivery. It is highly improbable that you’ll get bored being an SRE. On the contrary, it’s a highly creative, stimulating, and technically challenging role.
Last but not least, since SREs are typically found at high-performing tech companies that have large data centers and complex technical challenges, their roles can be inspiring from both a financial and workplace culture perspective. Another plus: Google considers SREs scarce resources.
1 Comment