Getting a model to answer is easy; serving it reliably and affordably is the job. Learn inference, optimization, serving frameworks, cost and latency, and MLOps — remembered with spaced repetition.
Calling a hosted API is one thing; running a model in production is another. Serving means standing up an inference endpoint that takes requests, runs prefill and decode, streams tokens back, and stays up under real traffic — whether you rent it from a provider or self-host on your own GPUs.
The hard parts are performance and economics. Optimization techniques — the KV cache, batching, paged attention, quantization and speculative decoding — decide your throughput and latency. Serving frameworks handle OpenAI-compatible APIs, queuing, replicas and load balancing for you. And cost and latency — TTFT, tokens per second, GPU sizing and token billing — decide whether the feature is sustainable at scale.
This track grounds production serving in bite-sized, practical questions and uses spaced repetition, so the levers — and the MLOps discipline of versioning, rollouts, drift and monitoring — stay sharp when you are the one on call.
Each module is a set of practice cards — 75 in total. Answer, review, and watch your knowledge grow from seed to full bloom.
Inference basics — what serving a model means, hosted vs self-hosted, prefill and decode, streaming, and statelessness.
15 cardsOptimization — KV cache, quantization, batching, paged attention, speculative decoding, and latency versus throughput.
15 cardsServing frameworks — what they do, OpenAI-compatible APIs, tokenizers, parallelism, queuing, replicas, and load balancing.
15 cardsCost and latency — TTFT, tokens per second, GPU versus CPU, memory sizing, token billing, utilization, and autoscaling.
15 cardsMLOps and monitoring — versioning, rollouts, drift, evaluation in production, guardrails, observability, and logging.
15 cardsA taste of the real cards. Pick an answer, then reveal the explanation.
What does it mean to "serve" a model?
What is the KV cache during LLM generation?
What does time to first token (TTFT) measure?
What is model versioning in production?
Each card is one practical concept with multiple options. Pick what you think is right.
See the correct option plus a clear explanation, and a link to deeper docs when one is available.
A spaced-repetition engine (SM-2 or FSRS) resurfaces each card just before you would forget it.
TTFT, tokens per second, GPU memory and utilization decide cost — understanding them keeps an LLM feature affordable at scale.
The KV cache, batching and paged attention are where latency and throughput are won or lost. Know the levers, not just the buzzwords.
Serving frameworks hide real complexity — queuing, replicas, load balancing. Knowing what they do lets you tune instead of guess.
Versioning, rollouts, drift and monitoring are what keep a deployed model trustworthy after launch, not just at it.
No, but some backend or infrastructure background helps — this track is aimed at people running models in production. The concepts (latency, batching, cost) are explained plainly.
Yes. The Inference Basics module covers hosted vs self-hosted serving, and the rest applies to both — optimization, cost and MLOps matter wherever the model runs.
Yes, completely free. No registration or credit card is required, and all your progress is stored locally in your browser.
About 10 minutes a day. Spaced repetition means short, frequent sessions beat long cramming, so the levers and trade-offs stick.
Plant your first seed today. Ten minutes a day is all it takes to run models that stay fast, affordable and reliable under real load.