Reliability as an engineering discipline, explained from first principles. Learn SLIs, SLOs and error budgets, incident response, toil, DORA metrics and resilience — and remember it with spaced repetition.
Site Reliability Engineering (SRE) is the practice of running services by treating operations as a software problem. Instead of chasing 100% uptime, SRE asks a sharper question: how reliable does this service actually need to be, and how do we spend the rest of that budget on shipping features? The answer is built from measurable targets rather than gut feeling.
The core loop is SLIs, SLOs and error budgets. An SLI is what you measure (the fraction of requests served well), an SLO is the target you hold it to, and the error budget is the allowance of unreliability that target leaves you — the number that decides when to keep shipping and when to freeze and stabilize. Around that sit the daily disciplines: structured incident response, cutting toil through automation, and measuring delivery with DORA's four keys.
This track breaks SRE into bite-sized, practical questions across reliability targets, incident management, toil and automation, DORA metrics and resilience practices like load shedding and chaos engineering — and uses spaced repetition so the concepts stick when you are the one on call.
Each module is a set of flashcards — 50 in total. Answer, review, and watch your knowledge grow from seed to full bloom.
SLI vs SLO vs SLA, error budgets, budget policy, and choosing the right reliability target
10 cardsSeverity, on-call, incident command, escalation, and blameless postmortems
10 cardsWhat toil is, the 50% cap, why it is eliminated, and automating manual work away
10 cardsThe four keys of software delivery performance, throughput vs stability, and performance tiers
10 cardsCapacity planning, load shedding, graceful degradation, chaos engineering, and cascading-failure defences
10 cardsA taste of the real flashcards. Pick an answer, then reveal the explanation.
What is a Service Level Indicator (SLI)?
What is the role of the Incident Commander (IC) during an incident?
In SRE, what is toil?
What do the DORA “four keys” metrics measure?
Each card is one practical concept with multiple options. Pick what you think is right.
See the correct option plus a clear explanation, and a link to deeper docs when one is available.
A spaced-repetition engine (SM-2 or FSRS) resurfaces each card just before you would forget it.
SLOs and error budgets turn "is it reliable enough?" from an argument into a number the whole team can act on.
Knowing the Incident Commander model and structured response means outages get coordinated instead of chaotic.
Spotting toil — repetitive, automatable work with no lasting value — is how you win back time to actually engineer.
SLIs/SLOs, error budgets and DORA metrics are staple topics in SRE, platform and senior DevOps interviews.
No. The track starts from what reliability means and builds up SLIs, SLOs and error budgets from first principles, so developers and ops engineers moving toward reliability work both benefit.
This track covers the SRE discipline itself — targets, incidents, toil, DORA and resilience — independent of any platform. The Kubernetes Ops: Reliability & Security track applies similar ideas to running workloads on Kubernetes specifically.
Yes, completely free. No registration or credit card is required, and all your progress is stored locally in your browser.
About 10 minutes a day. Spaced repetition means short, frequent sessions beat long cramming, so the concepts stick.
Plant your first seed today. Ten minutes a day is all it takes to grow real, lasting reliability skills.