Reliability · 5 modules

Site Reliability Engineering

Reliability as an engineering discipline, explained from first principles. Learn SLIs, SLOs and error budgets, incident response, toil, DORA metrics and resilience — and remember it with spaced repetition.

flashcards
50
flashcards
per day
~10 min
per day
level
Beginner → Intermediate
level
modules
5
modules
About this topic

What is Site Reliability Engineering?

Site Reliability Engineering (SRE) is the practice of running services by treating operations as a software problem. Instead of chasing 100% uptime, SRE asks a sharper question: how reliable does this service actually need to be, and how do we spend the rest of that budget on shipping features? The answer is built from measurable targets rather than gut feeling.

The core loop is SLIs, SLOs and error budgets. An SLI is what you measure (the fraction of requests served well), an SLO is the target you hold it to, and the error budget is the allowance of unreliability that target leaves you — the number that decides when to keep shipping and when to freeze and stabilize. Around that sit the daily disciplines: structured incident response, cutting toil through automation, and measuring delivery with DORA's four keys.

This track breaks SRE into bite-sized, practical questions across reliability targets, incident management, toil and automation, DORA metrics and resilience practices like load shedding and chaos engineering — and uses spaced repetition so the concepts stick when you are the one on call.

What you'll learn

5 modules, seed to bloom

Each module is a set of flashcards — 50 in total. Answer, review, and watch your knowledge grow from seed to full bloom.

Reliability Targets

SLI vs SLO vs SLA, error budgets, budget policy, and choosing the right reliability target

10 cards

Incident Management

Severity, on-call, incident command, escalation, and blameless postmortems

10 cards

Toil & Automation

What toil is, the 50% cap, why it is eliminated, and automating manual work away

10 cards

DORA Metrics

The four keys of software delivery performance, throughput vs stability, and performance tiers

10 cards

Resilience Practices

Capacity planning, load shedding, graceful degradation, chaos engineering, and cascading-failure defences

10 cards
Try before you plant

Sample questions

A taste of the real flashcards. Pick an answer, then reveal the explanation.

Sample · Site Reliability Engineering

What is a Service Level Indicator (SLI)?

  • AA quantitative measure of a service level — such as the proportion of requests served successfully or fast enough
  • BA target value a service should meet — such as 99.9% of requests succeeding over a rolling 28-day window
  • CA contract with customers that sets financial penalties — such as refunds when uptime falls below a threshold
  • DA remaining allowance of unreliability — such as the number of failed requests a team may spend before freezing
Permalink & share
Sample · Site Reliability Engineering

What is the role of the Incident Commander (IC) during an incident?

  • ATo hold overall coordination and decision-making authority, delegating tasks and keeping the response organised
  • BTo personally debug and fix the failing system while everyone else waits quietly for the outage to end
  • CTo write the customer-facing status page updates alone while no one coordinates the technical response
  • DTo approve the budget and hardware purchases needed before any mitigation work is allowed to start
Permalink & share
Sample · Site Reliability Engineering

In SRE, what is toil?

  • AManual, repetitive, automatable operational work that has no enduring value and scales with service size
  • BAny difficult engineering project that requires deep design thinking and permanently improves the service
  • CThe paperwork of meetings, planning, and email that surrounds a team but never touches production systems
  • DOne-off creative debugging of a novel outage that the team has never encountered or documented before
Permalink & share
Sample · Site Reliability Engineering

What do the DORA “four keys” metrics measure?

  • ASoftware delivery and operational performance — how quickly and reliably a team ships changes to production
  • BThe raw code quality of a repository — how many style violations and bugs a static analyzer can detect
  • CIndividual developer productivity — how many lines of code and commits each engineer produces per sprint
  • DInfrastructure spend efficiency — how much a team pays per server and how fully its capacity is utilised
Permalink & share
How Gnoseed works

Learn it once, keep it for good

1

Answer a question

Each card is one practical concept with multiple options. Pick what you think is right.

2

Get the full answer

See the correct option plus a clear explanation, and a link to deeper docs when one is available.

3

Review at the right time

A spaced-repetition engine (SM-2 or FSRS) resurfaces each card just before you would forget it.

Why learn this

Why SRE is worth your time

A framework for reliability decisions

SLOs and error budgets turn "is it reliable enough?" from an argument into a number the whole team can act on.

Calmer incidents

Knowing the Incident Commander model and structured response means outages get coordinated instead of chaotic.

Automate the right things

Spotting toil — repetitive, automatable work with no lasting value — is how you win back time to actually engineer.

Interview-ready

SLIs/SLOs, error budgets and DORA metrics are staple topics in SRE, platform and senior DevOps interviews.

FAQ

Common questions

Do I need to be an SRE already? +

No. The track starts from what reliability means and builds up SLIs, SLOs and error budgets from first principles, so developers and ops engineers moving toward reliability work both benefit.

How is this different from the Kubernetes reliability track? +

This track covers the SRE discipline itself — targets, incidents, toil, DORA and resilience — independent of any platform. The Kubernetes Ops: Reliability & Security track applies similar ideas to running workloads on Kubernetes specifically.

Is it free? +

Yes, completely free. No registration or credit card is required, and all your progress is stored locally in your browser.

How long does it take? +

About 10 minutes a day. Spaced repetition means short, frequent sessions beat long cramming, so the concepts stick.

Ready to master SRE?

Plant your first seed today. Ten minutes a day is all it takes to grow real, lasting reliability skills.

Start learning free