← Blog

How On-Device AI Actually Works (No Jargon, Promise)

What is a 2-billion-parameter model doing on your phone? A plain-English walkthrough of how on-device AI actually works — and what it costs.

Why this article exists

If you have downloaded an app that promises “AI on your phone, no internet needed,” there is a reasonable chance you have wondered how that is even physically possible. Your phone has a battery, not a data center. Doesn’t ChatGPT take a building’s worth of GPUs to answer one question?

The short answer is: yes, the original models are massive — but the ones running on your phone are smaller siblings, and a few engineering tricks make them small enough to fit. Below is a plain-English explanation of how that actually works, written for someone who is curious about what is happening inside Cove Travel’s translation but does not want to spend an afternoon on Wikipedia.

What an “AI model” actually is

A modern language model is, at the boring engineering level, a giant grid of numbers. Billions of them. When you type “translate hello to Japanese,” the phone does an enormous amount of multiplication and addition with those numbers, and out the other end comes the word “こんにちは.”

That grid of numbers is what got produced when the model was trained — some people at Google fed it most of the readable internet, in many languages, and adjusted the numbers until the model could predict what word comes next in any sentence. That training step is what eats data centers. It is a one-time cost.

Once trained, the model is just the grid. You can copy it. You can ship it to a phone. The phone does not need a data center to use the grid — it only needs enough memory to hold it and enough math throughput to multiply through it once per word.

Why 2 billion parameters fits in your pocket

A “2-billion-parameter” model has 4 billion of those numbers in the grid. That sounds enormous, and it is — but each number is small (1-2 bytes), and modern phones have a surprising amount of RAM (8-12 GB on a Pixel 9 or recent iPhone).

Here is the napkin math:

ItemSize
Raw 4B model, no compression16 GB
8-bit quantized4-5 GB
4-bit quantized2-3 GB
Your phone’s RAM8-12 GB
Your phone’s storage128-512 GB

Quantization is the trick. Instead of storing each of those 4 billion numbers at full precision, you round them to fewer bits — like storing a photo as a JPEG instead of a TIFF. The compressed model is a few percent worse than the uncompressed one but takes a quarter of the size. For Cove Travel, that is the difference between fitting on your phone and not.

What the phone actually does when you ask it something

When you point Cove Travel at a Japanese menu, here is the rough sequence:

  1. The camera captures a frame and sends it to the model as image data.
  2. The model converts the image into a sequence of internal “tokens” — the model’s own way of representing chunks of meaning.
  3. The model walks through the grid of numbers, predicting the next token given everything it has seen so far. It does this hundreds of times in a row, generating one token per step.
  4. The tokens get converted back into text and shown on screen.

Each of those “walk through the grid” steps takes a few tens of milliseconds on a recent phone. A short translation finishes in under 500 ms. A longer one takes a couple of seconds.

The thing that makes this fast enough to feel instant is a special chip — the NPU (Neural Processing Unit) — that almost every flagship Android or iPhone has had since around 2019. The NPU is purpose-built for the kind of math that language models do. Running the same model on the regular CPU would be 5-10× slower and drain the battery much faster.

What the trade-offs really are

This is the part most marketing pages leave out. Smaller on-device models are genuinely worse than their cloud counterparts in three honest ways:

  • Less knowledge of obscure facts. A 4B model has read less than a cloud-scale 200B+ model. It will sometimes get rare place names, niche technical terms, or obscure historical references wrong. For travel, this rarely matters; for legal research, it would.
  • Shorter “context window.” The model can remember less of the conversation at once. Cloud models can hold 100,000+ tokens of context; a phone-friendly model usually holds 8,000 or so. For a translation app this is plenty; for “summarize my entire book” it is not.
  • Smaller “creative range.” When you ask a cloud model to brainstorm, the larger parameter count helps it generate more varied phrasings. A smaller model is more conservative.

The trade you make is: you give up a few percent of accuracy on the long tail of weird inputs, and in exchange you get latency under 500 ms, zero network dependency, and zero data leaves your phone. For a travel translator that runs in a Tokyo subway, that is the right trade. For drafting a legal contract, it is not.

Why “on-device” matters for privacy

Cloud AI works by sending your input to the cloud, running the model on a server, and sending the answer back. The server logs your input. Even companies with strict privacy policies retain enough metadata to reconstruct patterns. The privacy boundary is “we promise not to look.”

On-device AI works by running the model on your phone. Your input never leaves the device. There is nothing for a server to log because there is no server in the loop. The privacy boundary is the device boundary — which is the only one that actually holds.

This is also why “private cloud AI” is a contradiction. As long as your data has to traverse the network and be processed by someone else’s hardware, the trust requirement is “trust them.” On-device removes that.

What this looks like in Cove

Cove Travel ships Google Gemma 4 E2B — a specific 2-billion-parameter model from Google that was designed for on-device deployment. The first time you open the app, it downloads the model once (about 2.5 GB). After that:

  • Every translation runs on your phone’s NPU.
  • Every photo you point the camera at gets analyzed locally — never uploaded.
  • Every conversation in the two-way voice mode stays on the device.
  • Uninstalling the app deletes the entire model.

The same architecture extends across the Cove family — the upcoming Voice, Photo, and Health apps all share the same on-device approach. The model is one download; the apps are different ways of using it.

Where to read further

The two pieces this article references:

If you want the engineering depth, the official Gemma model card has the parameter counts, training-data details, and benchmark scores. The article above is the version for someone who wants to use the technology, not build it.