MiniCPM-V 4.0: Vision-Specialized On-Device Multimodal Model

4.1B parameters, 2.5 GB quantized, 32K context, native vision focus — OpenBMB's mobile-optimized vision model that punches above its weight on OpenCompass.

Last reviewed: May 2026
Parameters4.1 B
Size (quantized)2.5 GB
Context length32,768 tokens
Modalitytext+vision
Licensemodelbest-terms
Min RAM4 GB
VersionMiniCPM-V 4.0 (4.1B)
Released2025-08

What is it?

MiniCPM-V 4.0 is the mobile-optimized member of the MiniCPM-V series from ModelBest and OpenBMB (the open-source community spun out of Tsinghua University), released in August 2025. The series targets a specific niche: vision-specialized multimodal models that ship as edge-deployable open weights. Unlike Gemma 4 and Qwen 3.5, which add vision as a secondary capability on top of a general-purpose LLM, MiniCPM-V is trained from the ground up for image understanding tasks — and the 4.1 B parameter variant punches above its weight, beating GPT-4.1-mini on the OpenCompass vision benchmark while being a fraction of the size.

Core specs at a glance

(See spec card above — populated from structured data.)

What devices can run it?

The 4.1 B variant at Q4 quantization fits in roughly 2.5 GB of storage with about 3-4 GB of RAM headroom for the vision encoder. That covers Pixel 8 and newer, iPhone 15 Pro and newer, iPhone 16 Pro Max (OpenBMB’s published benchmark device, hitting 17.9 tokens/sec with under 2s time-to-first-token), and most modern Android phones with 4 GB+ RAM. Vision encoder execution can be a memory bottleneck — mid-range phones may need more aggressive quantization or smaller image inputs to maintain throughput.

Strengths and limitations

Strengths. Specialized vision training pays off on benchmarks: MiniCPM-V 4.0 hits 69.0 on OpenCompass, beating GPT-4.1-mini (released April 2025) and matching the previous-generation MiniCPM-V 2.6 (8B) at half the parameter count. On-device performance is genuinely usable: 17.9 tokens/sec on iPhone 16 Pro Max with under 2s time-to-first-token, no thermal throttling. Strong OCR and document analysis via the LLaVA-UHD architecture, leading on OCRBench. Active OpenBMB community ships frequent updates. (The aggressive 6-frame-to-64-token 96× video compression is a feature of MiniCPM-V 4.5’s new 3D-Resampler — see the FAQ above for the V 4.0 vs 4.5 split.)

Limitations. Custom ModelBest license adds friction versus Apache 2.0 / MIT alternatives. Smaller 32 K context — half of Gemma 4’s 128 K, far less than Qwen 3.5’s 262 K. Less general-purpose than peers — MiniCPM-V is great at vision but not the best for pure-text chat or long-document reasoning. The MiniCPM-o variant adds voice but jumps to 9 B parameters.

When to choose it (and when not to)

Choose MiniCPM-V 4.0 if: vision is your primary axis of value (OCR, image Q&A, document understanding, video summarization); you need SOTA benchmark accuracy on a modest device; you can navigate the ModelBest license registration step.

Skip it if: you need text-dominant general-purpose chat (Gemma 4, Qwen 3.5, or Ministral 3B are better generalists); you need the simplest possible license (Apache 2.0 alternatives win); you need long-context support (Qwen 3.5 at 262 K is a different league); you need audio in the same model (MiniCPM-o 4.5 adds voice, but Gemma 4 and Phi-4-multimodal cover this in smaller packages).

How it compares to similar on-device models

Closest peers are Llama 3.2 Mobile (text-only, no vision) and Qwen 3.5 2B (also vision, but more general-purpose). MiniCPM-V 4.0 differentiates by being purpose-trained for vision benchmarks rather than treating vision as an add-on. For full side-by-side, see the leaderboard.

In a real Cove app

Cove Photo uses Gemma 4 for image understanding because we need a single model that also handles text-heavy tasks like context summaries. MiniCPM-V 4.0 would be the model to pick if Cove Photo’s value were narrowly focused on visual accuracy — for example, an OCR-heavy receipt scanner or a museum-artwork-explainer app. The architectural insight from MiniCPM-V — that aggressive vision token compression preserves quality — has informed how Cove Photo handles long photo sequences in album mode.

See it in a real Cove app

FAQ

Why pick MiniCPM-V over Gemma 4 or Qwen 3.5 for vision?

MiniCPM-V is purpose-trained for vision tasks. The 4.0 variant beats GPT-4.1-mini on the OpenCompass vision benchmark despite being a fraction of the size. Gemma 4 and Qwen 3.5 add vision as a secondary capability; MiniCPM-V is the dedicated vision model in this comparison set.

What's the difference between MiniCPM-V 4.0 and 4.5?

Both are vision-focused. The 4.0 (4.1B parameters) is mobile-optimized — fits in 2.5 GB at Q4 and runs on 4 GB-RAM phones. The 4.5 (8B) scores higher on OpenCompass (77.0, beating Qwen2.5-VL 72B) and adds a unified 3D-Resampler that compresses 6 video frames to 64 tokens (96× compression) for efficient video understanding — but it is heavier and targets iPads and laptops, not phones.

What about MiniCPM-o 4.5 — full omni-modal?

MiniCPM-o 4.5 is the 9B omni-modal sibling: it adds speech input/output and full-duplex live streaming on top of MiniCPM-V's vision capability. Think of it as MiniCPM-V 4.5 plus voice — comparable in capability scope to Gemini 2.5 Flash but running on iPad M4. Targets larger devices than the V 4.0 mobile sweet spot.

What devices can run MiniCPM-V 4.0?

Pixel 8 and newer, iPhone 15 Pro and newer, iPhone 16 Pro Max (OpenBMB's published benchmark device — 17.9 tokens/sec with under 2s time-to-first-token), and most Android phones with 4 GB+ RAM. The 4.1B parameter model at Q4 quantization needs about 2.5 GB of storage and 3-4 GB of RAM headroom for image processing. Mid-range phones may need more aggressive quantization.

What's the license like?

Custom ModelBest terms (modelbest-terms in our schema). The license permits commercial use and modifications but requires registration. This is more involved than Apache 2.0 (Gemma 4, Qwen 3.5, Mistral) or MIT (Phi), but still allows enterprise deployment with reasonable terms.

Citations