Context Length
Context length is how many tokens the model can read at once — bigger means it can process longer documents, but uses more RAM at inference time.
A 4K context window means the model can read up to about 3,000 English words at once before older content is dropped. On-device models in 2026 typically support 8K-128K context windows. Longer contexts require quadratic memory in attention layers, which is why mobile models cap below cloud equivalents.
See also: KV Cache , Attention , Token
Token
The basic unit of input and output for an LLM — usually a word piece, a punctuation mark, or a small sequence of bytes.
A tokenizer chops text into tokens before the model sees it. English averages roughly 0.75 tokens per word; Chinese and Japanese average 1-2 tokens per character because their tokenizers handle multi-byte UTF-8 differently. Pricing for cloud LLMs is per token, and on-device throughput is reported in tokens per second. Most "context length" and "max output length" limits are measured in tokens, not characters.
See also: Context Length , Throughput , Embedding
Throughput
Tokens per second produced during generation; the headline speed metric for on-device LLMs after the first token is out.
Useful reference points: human reading speed is roughly 5-10 tok/s; comfortable streaming chat needs ~15+ tok/s. As of 2026, a 3B 4-bit model on an iPhone 15 Pro typically generates 25-40 tok/s; on M4 Pro the same model exceeds 100 tok/s. Throughput is bounded almost entirely by memory bandwidth in the autoregressive phase, and by raw compute during prompt prefill.
See also: Latency (Time to First Token) , Memory Bandwidth , Token
Latency (Time to First Token)
How long the user waits before the first generated token appears, mostly determined by prompt length and prefill compute speed.
Time-to-first-token (TTFT) and throughput are different metrics — TTFT covers prompt prefill (computing KV cache for the entire input), while throughput governs the streaming phase after that. A 4K-token prompt may take 1-2 seconds before the first response token appears even on fast hardware. UX implication: keep system prompts short, and hide latency behind streaming animations or "thinking..." indicators.
See also: Throughput , Context Length , KV Cache
Temperature
A sampling parameter that controls randomness — low values make the model deterministic and focused, high values make it creative and varied.
Mathematically, temperature divides the model's logits before softmax — lower values sharpen the probability distribution, higher values flatten it. T=0 means always pick the top token (deterministic); T=1.0 is the model's natural distribution; T=1.5+ injects significant randomness. Practical guidance: use 0.0-0.3 for translation, summarization, and structured output; 0.7-1.0 for creative writing and brainstorming.
See also: Top-p (Nucleus Sampling) , Token
Top-p (Nucleus Sampling)
A sampling cutoff that picks the next token only from the smallest set of candidates whose probabilities sum to p (e.g., 0.9).
Top-p sampling adapts to the model's confidence: when the model is sure, the nucleus might contain only 2-3 tokens; when it's uncertain, the nucleus expands to dozens. This is usually preferred over a fixed top-k because it stays sharp on factual answers and stays diverse on open-ended prompts. Common pairing: temperature 0.7 + top-p 0.9 as a balanced creative default.
See also: Temperature , Token