SRE Fundamentals
Understand Site Reliability Engineering — the discipline that applies software engineering principles to operations, pioneered by Google to manage planet-scale systems.
What is SRE?
Site Reliability Engineering (SRE) was created by Google in 2003 when VP Ben Treynor Sloss asked: 'What happens when a software engineer designs an operations team?' SRE applies software engineering principles to infrastructure and operations problems. Instead of manual toil, SREs automate everything. Instead of chasing 100% uptime (which is mathematically impossible and economically wasteful), SREs define acceptable reliability targets (SLOs) and use error budgets to balance reliability with feature velocity. Key principle: SREs spend at most 50% of their time on ops work — the other 50% is engineering projects that reduce future toil. Google's SRE teams manage Search, Gmail, YouTube, and Cloud — serving billions of users with 99.99%+ availability.
SRE vs DevOps
- DevOps is a culture/philosophy; SRE is a concrete implementation with specific practices and roles
- DevOps says 'break down silos'; SRE creates a specific team structure with defined responsibilities
- DevOps emphasizes CI/CD pipelines; SRE emphasizes SLOs, error budgets, and blameless postmortems
- DevOps focuses on deployment speed; SRE balances speed with reliability using error budgets
- SRE uses software engineering to solve ops problems — writing code to eliminate manual work (toil)
- They're complementary: SRE can be seen as a specific, opinionated implementation of DevOps principles
SLIs, SLOs, and Error Budgets
- SLI (Service Level Indicator): A quantitative measure of service behavior — request latency (P99 < 200ms), error rate (< 0.1%), availability (successful responses / total responses)
- SLO (Service Level Objective): A target value for an SLI — '99.9% of requests complete in under 200ms over a 30-day window'. NOT 100% — that's impossible and uneconomical
- SLA (Service Level Agreement): A business contract with consequences for missing SLOs — typically more generous than internal SLOs (SLO: 99.95%, SLA: 99.9%)
- Error Budget: 100% minus SLO = error budget. With 99.9% SLO, you have 0.1% error budget = ~43 minutes/month of allowed downtime. This is your 'budget' for risk
- Error Budget Policy: When budget is exhausted, freeze deployments and focus on reliability. When budget is healthy, deploy aggressively. This naturally balances speed vs stability
- Burn Rate: How fast you're consuming your error budget — if you burn 30 days of budget in 1 hour, that's a critical incident requiring immediate response