What is the KV cache during LLM generation?

Question

Accepted Answer

Stored key and value tensors of past tokens reused across decode steps — The KV cache holds the attention keys and values for already-seen tokens so each step reuses them; it speeds generation but is distinct from caching whole responses.

Answer

A copy of the model's weights kept in CPU RAM as a backup for the GPU

Answer

A pool of finished responses returned directly when prompts repeat

Answer

A log of every request and reply kept on disk for later auditing

What is the KV cache during LLM generation?

Why this is the answer

More Deploying & Serving LLMs in Production flashcards