After the model emits logits over the vocabulary, decoding chooses the next token—shaping creativity, determinism, and latency.
Common parameters
| Parameter | Effect |
|---|---|
temperature | Higher → more random, diverse outputs |
top_p (nucleus) | Sample from smallest set whose cumulative prob ≥ p |
max_tokens | Cap completion length and cost |
stop sequences | End generation early for structured pipelines |
Greedy vs sampling
Greedy (temperature 0) picks the top token—best for JSON extraction and tests. Sampling helps brainstorming copy but hurts reproducibility unless you fix seeds where supported.
Latency tips
Streaming tokens to the UI improves perceived speed. Batch embeddings offline; keep chat paths warm with connection pooling where the SDK allows.
Important interview questions and answers
- Q: temperature=0 for unit tests?
A: Yes—stable outputs make regression tests possible.
Self-check
- What does temperature control?
- When is greedy decoding preferred?
Tip: temperature=0 for JSON extraction tests; higher only for creative copy drafts.
Interview prep
- temperature 0?
Greedy/deterministic—good for tests and structured extraction.
- top_p?
Nucleus sampling limits token pool while keeping diversity.