Methodology
How we built this leaderboard. All 8 models are evaluated against the same dimensions — parameters, quantized size, context window, modality, license, and minimum device RAM — sourced from official model cards (Hugging Face, vendor blogs, official documentation) as of the last-reviewed date shown above. We do not run our own benchmarks; instead, we cross-reference 2-3 authoritative sources per data point and prefer the vendor's own claim where it conflicts with third-party reproduction. Numbers may diverge from your real-world experience by ±10-20% depending on quantization scheme (Q4_K_M, AWQ, GPTQ all behave differently), runtime (LiteRT, MediaPipe, ExecuTorch, llama.cpp, Core ML), and device thermal throttling. Each model card carries its own `lastReviewed` field; this page is refreshed every quarter. Conflicts and ambiguities are tracked in our open GitHub repo.