On-Device AI Glossary: Quantization, KV Cache, NPU Explained

Optimization

Quantization

Quantization shrinks a model by storing weights as lower-precision numbers (e.g., 4-bit instead of 16-bit), trading minor accuracy loss for much smaller size.

Quantization is the most important reason on-device AI is now feasible. A 7B-parameter model in FP16 is ~14 GB; quantized to 4-bit it shrinks to ~4 GB and runs on consumer phones with minimal quality loss for most use cases. Common formats include GGUF Q4_K_M, AWQ, and GPTQ.

Pruning

Removing weights or whole neurons that contribute little to outputs, shrinking the model with minimal quality loss.

Pruning comes in two flavors: unstructured (zeroing individual weights, hard to accelerate on real hardware) and structured (removing whole channels or attention heads, which actually speeds up inference). Modern on-device models often combine 2:4 structured sparsity (NVIDIA-style) with quantization. Typical results: 30-50% size reduction with under 2 percentage points of benchmark loss.

See also: Quantization , Knowledge Distillation

LoRA (Low-Rank Adaptation)

A fine-tuning method that trains tiny low-rank matrices on top of frozen weights, producing per-task adapters that ship as small extra files.

A full fine-tune of a 7B model needs hundreds of GB of GPU memory; LoRA reduces trainable parameters by 1000-10000× by only learning rank-r updates (typically r=8 or 16). On-device implication: one base model can host many swappable LoRA adapters (each 5-50 MB), so a phone app can keep "translation mode" and "summarization mode" weights without storing two full models.

See also: Quantization , Knowledge Distillation

GGUF (GGML Universal File)

A single-file model format that bundles weights, tokenizer, and metadata, designed for fast memory-mapped loading by llama.cpp and friends.

GGUF replaced the older GGML format in 2023 and became the default in the open-source on-device community. Key features: multiple quantization levels in the same family (Q2_K through Q8_0), built-in chat template, and mmap-friendly layout that lets a 4 GB model load in milliseconds. Most "download a model" UX on phones today (LM Studio, llama.cpp, Ollama) ultimately consumes GGUF files.

AWQ (Activation-aware Weight Quantization)

A 4-bit quantization method that protects the small fraction of weights tied to large activations, preserving more accuracy than naive rounding.

AWQ's key insight: only about 1% of weights matter disproportionately, and they can be identified by looking at activation statistics from a small calibration set. Those weights get higher precision; the rest are aggressively quantized to 4-bit. Compared to GPTQ, AWQ tends to be faster at inference and slightly more robust on instruction-following benchmarks. Common pairing: AWQ-quantized weights inside a GGUF or safetensors container.

Deployment

On-device AI

On-device AI runs the model directly on your phone or laptop, with no cloud roundtrip — preserving privacy, working offline, and offering low latency.

On-device AI is the architectural choice that defines apps like Cove. The trade-off vs cloud: smaller usable model size (constrained by phone RAM/storage), but instant response, full privacy of inputs, and zero per-request cost. As of 2026, models in the 2-4B parameter range with 4-bit quantization run comfortably on flagship phones.

Edge AI

A broader category than on-device that includes phones, laptops, IoT sensors, in-vehicle compute, and edge servers — anywhere outside a central cloud.

Edge AI and on-device AI are often used interchangeably, but edge is the umbrella term: a security camera doing object detection on a Raspberry Pi, a factory sensor running anomaly detection in a 5G base station, and a phone running an LLM are all edge AI; only the last is strictly on-device. The shared design pressure is identical — local compute, limited memory, latency-sensitive — which is why most on-device techniques transfer directly to edge deployments.

Federated Learning

A training paradigm where many devices collaboratively improve a shared model by sending only weight updates, never raw user data, to a central server.

Federated learning solves the tension between needing data to improve models and respecting privacy. Famous deployment: Gboard's next-word prediction trains across hundreds of millions of phones without any keystroke leaving the device. Combined with secure aggregation and differential privacy, even the weight updates reveal nothing about individuals. On-device inference + federated learning forms a complete privacy-preserving ML loop.

See also: On-device AI , Private by Default , Edge AI

Private by Default

A design philosophy where user data stays on-device unless the user explicitly opts in to send it elsewhere — privacy as the starting state.

Private by default flips the dominant cloud-AI model where every input is uploaded by default. Apple Intelligence, the Cove apps, and many recent on-device products commit to this stance: photo, voice, health, and translation inputs never leave the phone. The technical enablers are on-device inference, federated learning for improvement loops, and clear separation of any opt-in features. Marketing-wise, "your data never leaves your phone" is becoming a competitive moat against cloud-only competitors.

See also: On-device AI , Federated Learning , Edge AI

LLM vs SLM

Large Language Models (often 70B+ params, cloud-only) vs Small Language Models (typically under 8B, designed to run on-device).

The line is fuzzy and shifting. As of 2026, "SLM" usually means models under 8B parameters tuned to run within 4-8 GB of phone RAM after quantization (Phi, Gemma 3 small, MiniCPM, Llama 3 8B mobile). LLMs in the 70B-1T range still live in the cloud. The interesting middle ground — 13-30B models — runs comfortably on M-series Macs but not phones, creating a "personal cloud" tier that some products use as a privacy-friendly backstop.

Multimodal

Models that accept and reason over more than one input type — typically text plus images, audio, or video — within a single architecture.

Multimodal LLMs typically attach a vision encoder (like a small ViT) and/or audio encoder to a language model, projecting their outputs into the same embedding space as text tokens. On-device examples in 2026 include Gemma 4 multimodal, Apple Foundation Models with vision, and Phi-4-multimodal. Cove apps lean heavily on this: Cove Photo describes images, Cove Voice transcribes and summarizes, Cove Travel reads signs from camera input — all from a single multimodal model.

See also: Embedding , Transformer , Inference Runtime

Inference Runtime

The library or engine that actually executes a model on a device — examples include LiteRT, MediaPipe, ExecuTorch, Core ML, and llama.cpp.

Runtimes handle quantization formats, memory mapping, NPU/GPU dispatch, KV cache management, and the Mutex around concurrent inference. Choice matters: Core ML and Apple Foundation Models are the right path on iOS for ANE access; LiteRT and MediaPipe dominate Android with Hexagon/Tensor support; ExecuTorch (PyTorch Edge) is gaining cross-platform traction; llama.cpp remains the open-source default for GGUF models. The Cove apps depend on LiteRT-LM via the InferenceEngine wrapper.

Inference

Context Length

Context length is how many tokens the model can read at once — bigger means it can process longer documents, but uses more RAM at inference time.

A 4K context window means the model can read up to about 3,000 English words at once before older content is dropped. On-device models in 2026 typically support 8K-128K context windows. Longer contexts require quadratic memory in attention layers, which is why mobile models cap below cloud equivalents.

See also: KV Cache , Attention , Token

Token

The basic unit of input and output for an LLM — usually a word piece, a punctuation mark, or a small sequence of bytes.

A tokenizer chops text into tokens before the model sees it. English averages roughly 0.75 tokens per word; Chinese and Japanese average 1-2 tokens per character because their tokenizers handle multi-byte UTF-8 differently. Pricing for cloud LLMs is per token, and on-device throughput is reported in tokens per second. Most "context length" and "max output length" limits are measured in tokens, not characters.

See also: Context Length , Throughput , Embedding

Throughput

Tokens per second produced during generation; the headline speed metric for on-device LLMs after the first token is out.

Useful reference points: human reading speed is roughly 5-10 tok/s; comfortable streaming chat needs ~15+ tok/s. As of 2026, a 3B 4-bit model on an iPhone 15 Pro typically generates 25-40 tok/s; on M4 Pro the same model exceeds 100 tok/s. Throughput is bounded almost entirely by memory bandwidth in the autoregressive phase, and by raw compute during prompt prefill.

Latency (Time to First Token)

How long the user waits before the first generated token appears, mostly determined by prompt length and prefill compute speed.

Time-to-first-token (TTFT) and throughput are different metrics — TTFT covers prompt prefill (computing KV cache for the entire input), while throughput governs the streaming phase after that. A 4K-token prompt may take 1-2 seconds before the first response token appears even on fast hardware. UX implication: keep system prompts short, and hide latency behind streaming animations or "thinking..." indicators.

See also: Throughput , Context Length , KV Cache

Temperature

A sampling parameter that controls randomness — low values make the model deterministic and focused, high values make it creative and varied.

Mathematically, temperature divides the model's logits before softmax — lower values sharpen the probability distribution, higher values flatten it. T=0 means always pick the top token (deterministic); T=1.0 is the model's natural distribution; T=1.5+ injects significant randomness. Practical guidance: use 0.0-0.3 for translation, summarization, and structured output; 0.7-1.0 for creative writing and brainstorming.

See also: Top-p (Nucleus Sampling) , Token

Top-p (Nucleus Sampling)

A sampling cutoff that picks the next token only from the smallest set of candidates whose probabilities sum to p (e.g., 0.9).

Top-p sampling adapts to the model's confidence: when the model is sure, the nucleus might contain only 2-3 tokens; when it's uncertain, the nucleus expands to dozens. This is usually preferred over a fixed top-k because it stays sharp on factual answers and stays diverse on open-ended prompts. Common pairing: temperature 0.7 + top-p 0.9 as a balanced creative default.

See also: Temperature , Token

Architecture

KV Cache

KV cache stores intermediate attention computations during generation, dramatically speeding up token-by-token output but consuming significant RAM.

Without KV cache, each new token requires recomputing attention against all previous tokens — quadratic cost. With KV cache, this is amortized to linear. The cache size scales with context length and model dimensions, often becoming the dominant memory cost on-device for long contexts.

Transformer

The neural network architecture that powers virtually every modern LLM, built around self-attention layers stacked into deep blocks.

Introduced in 2017 by Google ("Attention Is All You Need"), the Transformer replaced earlier RNN/LSTM designs by processing tokens in parallel rather than sequentially. A typical on-device LLM stacks 24-40 transformer blocks; each block contains multi-head attention and a feed-forward network. Most efficiency work for on-device AI (KV cache, quantization, MoE) targets transformer internals.

Attention

The core mechanism that lets a transformer weigh which earlier tokens matter most when predicting the next token in a sequence.

Attention computes a weighted sum over all previous tokens, where weights are learned dot-products of query and key vectors. Modern LLMs use multi-head attention (typically 16-32 heads) so different heads can specialize on different relationships. Attention is the most compute- and memory-hungry part of inference, which is why optimizations like Flash Attention, Grouped Query Attention, and KV caching are central to on-device performance.

See also: Transformer , KV Cache , Context Length

MoE (Mixture of Experts)

An architecture that routes each token to only a few of many specialized expert sub-networks, giving large total capacity at low active compute.

A typical MoE model might have 8-64 experts but activate only 2 per token, so a 56B-parameter MoE can run with the compute footprint of an 8B dense model. Examples include Mixtral, DeepSeek-MoE, and parts of Gemma 4. On-device the trade-off shifts: total parameters still need to fit in RAM, so MoE is only useful on phones if combined with aggressive quantization or expert offloading.

See also: Transformer , Quantization , LLM vs SLM

Knowledge Distillation

Training a small student model to imitate a much larger teacher model, transferring most of its capability into a fraction of the size.

Distillation is how most on-device models get their unusually strong quality-to-size ratio. The student learns from the teacher's output probabilities (soft labels) rather than just final answers, capturing nuance the teacher considered. Gemma 3, Phi, and MiniCPM all rely heavily on distillation. The result: a 3B distilled model often beats a 7B model trained from scratch on the same data.

See also: LLM vs SLM , Quantization , Transformer

Embedding

A vector representation of a token (or a sentence) where semantically similar items end up close together in high-dimensional space.

Every transformer starts by mapping each input token to an embedding vector (typically 1024-4096 dimensions). The same idea powers semantic search and RAG: encode documents and queries into the same space, then find nearest neighbors by cosine similarity. On-device embedding models (e.g., MiniLM, GTE-small) are tiny — under 100 MB — making local semantic search practical on phones.

See also: Transformer , Token , Attention

Hardware

NPU (Neural Processing Unit)

A dedicated chip optimized for running neural networks. Modern phones ship with NPUs (Apple ANE, Google Tensor, Qualcomm Hexagon) for fast, low-power AI.

NPUs run inference 5-10× faster than CPUs and 2-3× more power-efficient than GPUs for AI workloads. The Apple Neural Engine (16 cores in A17 Pro), Google Tensor TPU, and Qualcomm Hexagon are the dominant mobile NPUs in 2026. Frameworks like Core ML, MediaPipe, and ONNX Runtime route tensor ops to NPUs automatically when supported.

Apple Neural Engine (ANE)

Apple's dedicated NPU built into every modern A-series and M-series chip, accessed through the Core ML framework on iOS and macOS.

The Neural Engine first shipped in the A11 (2017) with 2 cores; by A17 Pro it has 16 cores delivering ~35 TOPS. Crucially, ANE has its own dedicated SRAM and runs without contending for CPU/GPU resources, which means it can sustain ML workloads while the rest of the chip handles UI. iOS apps using the FoundationModels framework or Core ML automatically route eligible ops to the ANE.

Tensor Core

NVIDIA's matrix-multiply hardware unit found in Tegra mobile chips and desktop GPUs; massively accelerates the dense matmul behind transformers.

A Tensor Core executes a small matrix multiply (e.g., 4×4 × 4×4 → 4×4) in a single cycle, the operation that dominates LLM inference. Mobile relevance: NVIDIA Tegra Orin (Switch successor, automotive, robotics) ships hundreds of Tensor Cores, making it one of the strongest mobile-grade AI platforms outside of phones. Tensor Cores natively support FP16, BF16, INT8, and FP8 — the formats produced by quantization workflows.

Qualcomm Hexagon

Qualcomm's combined NPU/DSP found in Snapdragon mobile and PC chips, accessed via the QNN SDK and Snapdragon AI Engine.

Hexagon evolved from a DSP for audio/imaging into a full neural accelerator. Snapdragon 8 Gen 3 and X Elite ship with Hexagon NPUs delivering 45+ TOPS, putting them in the same league as Apple's ANE for on-device LLMs. Hexagon is the dominant Android-side mobile NPU; cross-platform LLM apps targeting Android typically rely on it via TensorFlow Lite, ONNX Runtime, or Qualcomm's QNN.

RAM vs VRAM

On desktop GPUs, model weights must fit in dedicated VRAM separate from system RAM. On phones, RAM is unified — same memory for CPU, GPU, and NPU.

On a desktop with an RTX 4090, the 24 GB of VRAM is separate from system RAM; loading a model into it requires copying across PCIe. On phones (and Apple Silicon), memory is unified: the same physical chips serve CPU, GPU, and NPU. This is why a phone with 8 GB RAM can run a 4 GB model as smoothly as a desktop GPU with 24 GB VRAM, and why iPhone Pro models with 8+ GB are the practical floor for serious on-device LLMs in 2026.

See also: Memory Bandwidth , On-device AI , Quantization

Memory Bandwidth

How fast model weights can stream from memory into compute units; for on-device LLMs this is usually the real bottleneck, not raw TOPS.

Generating one token requires reading every weight in the model — for a 4 GB quantized model at 30 tok/s, that's 120 GB/s of bandwidth. iPhone 15 Pro tops out around 50 GB/s; M4 Pro reaches ~273 GB/s. This is why "TOPS" headlines underestimate real performance: a chip with 100 TOPS but 30 GB/s of bandwidth will still be memory-bound on LLM inference. Quantization helps double-duty here by shrinking each weight read.

See also: RAM vs VRAM , Quantization , KV Cache , Throughput