What is it?
MiniCPM-V 4.0 is the mobile-optimized member of the MiniCPM-V series from ModelBest and OpenBMB (the open-source community spun out of Tsinghua University), released in August 2025. The series targets a specific niche: vision-specialized multimodal models that ship as edge-deployable open weights. Unlike Gemma 4 and Qwen 3.5, which add vision as a secondary capability on top of a general-purpose LLM, MiniCPM-V is trained from the ground up for image understanding tasks — and the 4.1 B parameter variant punches above its weight, beating GPT-4.1-mini on the OpenCompass vision benchmark while being a fraction of the size.
Core specs at a glance
(See spec card above — populated from structured data.)
What devices can run it?
The 4.1 B variant at Q4 quantization fits in roughly 2.5 GB of storage with about 3-4 GB of RAM headroom for the vision encoder. That covers Pixel 8 and newer, iPhone 15 Pro and newer, iPhone 16 Pro Max (OpenBMB’s published benchmark device, hitting 17.9 tokens/sec with under 2s time-to-first-token), and most modern Android phones with 4 GB+ RAM. Vision encoder execution can be a memory bottleneck — mid-range phones may need more aggressive quantization or smaller image inputs to maintain throughput.
Strengths and limitations
Strengths. Specialized vision training pays off on benchmarks: MiniCPM-V 4.0 hits 69.0 on OpenCompass, beating GPT-4.1-mini (released April 2025) and matching the previous-generation MiniCPM-V 2.6 (8B) at half the parameter count. On-device performance is genuinely usable: 17.9 tokens/sec on iPhone 16 Pro Max with under 2s time-to-first-token, no thermal throttling. Strong OCR and document analysis via the LLaVA-UHD architecture, leading on OCRBench. Active OpenBMB community ships frequent updates. (The aggressive 6-frame-to-64-token 96× video compression is a feature of MiniCPM-V 4.5’s new 3D-Resampler — see the FAQ above for the V 4.0 vs 4.5 split.)
Limitations. Custom ModelBest license adds friction versus Apache 2.0 / MIT alternatives. Smaller 32 K context — half of Gemma 4’s 128 K, far less than Qwen 3.5’s 262 K. Less general-purpose than peers — MiniCPM-V is great at vision but not the best for pure-text chat or long-document reasoning. The MiniCPM-o variant adds voice but jumps to 9 B parameters.
When to choose it (and when not to)
Choose MiniCPM-V 4.0 if: vision is your primary axis of value (OCR, image Q&A, document understanding, video summarization); you need SOTA benchmark accuracy on a modest device; you can navigate the ModelBest license registration step.
Skip it if: you need text-dominant general-purpose chat (Gemma 4, Qwen 3.5, or Ministral 3B are better generalists); you need the simplest possible license (Apache 2.0 alternatives win); you need long-context support (Qwen 3.5 at 262 K is a different league); you need audio in the same model (MiniCPM-o 4.5 adds voice, but Gemma 4 and Phi-4-multimodal cover this in smaller packages).
How it compares to similar on-device models
Closest peers are Llama 3.2 Mobile (text-only, no vision) and Qwen 3.5 2B (also vision, but more general-purpose). MiniCPM-V 4.0 differentiates by being purpose-trained for vision benchmarks rather than treating vision as an add-on. For full side-by-side, see the leaderboard.
In a real Cove app
Cove Photo uses Gemma 4 for image understanding because we need a single model that also handles text-heavy tasks like context summaries. MiniCPM-V 4.0 would be the model to pick if Cove Photo’s value were narrowly focused on visual accuracy — for example, an OCR-heavy receipt scanner or a museum-artwork-explainer app. The architectural insight from MiniCPM-V — that aggressive vision token compression preserves quality — has informed how Cove Photo handles long photo sequences in album mode.