Gemma 4 launched — why we're not using it for clients (yet)
Google shipped a new generation of open models. Benchmarks are good. Field notes on why a production switch isn't automatic.
Gemma 4 shipped in early 2026. Benchmarks are strong. Open-source community is excited. In our side-by-side tests against a stable Llama 3.2 deployment, the switch is not obviously a win. Here's the reasoning.
What Gemma 4 does well
- Strong reasoning-benchmark performance relative to size
- Solid multilingual out of the box
- Efficient at quantized sizes (q4_K_M weights are reasonable)
All true. All genuinely impressive.
What it doesn't change for a running small deployment
For a focused FAQ-style chatbot, the ceiling on quality was already met with last year's small models. A stable 1B-parameter deployment answers visitor questions correctly 95%+ of the time. Upgrading to 7B Gemma 4 pushes that to maybe 97%.
The 2% is below the noise floor for most users. The cost of getting it — higher RAM footprint, slower inference on CPU, newer and less-battle-tested quants — is visible every day.
The policy we follow
Boring on purpose:
- Watch for 30 days after release. Early "groundbreaking" models routinely develop ugly gotchas in the first month: tokenizer bugs, license surprises, benchmark-only sparkle.
- Wait for reproducible Ollama quants. Until the community has q4_K_M quants that run identically across hosts, we're not shipping it to production on shared hardware.
- Run a week on our own infrastructure. If it can't stay stable on the same shared box every other workload runs on, it won't be stable for a client either.
- Measure on real workloads, not leaderboards. Does it answer the plumbing-customer FAQ better than what's already deployed? That's the number that matters, not MMLU.
Gemma 4 is at step 3 as of this writing. If it completes the gauntlet, we switch. If not, we don't.
What to do if you're currently running something stable
Don't upgrade just because a new model shipped. The "boring" answer is nearly always correct: measure, compare on your actual workload, switch only if the win is visible to users.
If evaluating vendors for new work and they tell you they always use "the latest model" — ask what they used a year ago. If the answer is a model that was trendy and didn't age well, there's your signal.
— Alex Kargin. More engineering writing at kargin-utkin.com.