dart-studio
← All writing
·2 min read·Alex Kargin

Gemma 4 launched — why we're not using it for clients (yet)

Google shipped a new generation of open models. Benchmarks are good. Field notes on why a production switch isn't automatic.

gemmamodel-selectionpatience

Gemma 4 shipped in early 2026. Benchmarks are strong. Open-source community is excited. In our side-by-side tests against a stable Llama 3.2 deployment, the switch is not obviously a win. Here's the reasoning.

What Gemma 4 does well

  • Strong reasoning-benchmark performance relative to size
  • Solid multilingual out of the box
  • Efficient at quantized sizes (q4_K_M weights are reasonable)

All true. All genuinely impressive.

What it doesn't change for a running small deployment

For a focused FAQ-style chatbot, the ceiling on quality was already met with last year's small models. A stable 1B-parameter deployment answers visitor questions correctly 95%+ of the time. Upgrading to 7B Gemma 4 pushes that to maybe 97%.

The 2% is below the noise floor for most users. The cost of getting it — higher RAM footprint, slower inference on CPU, newer and less-battle-tested quants — is visible every day.

The policy we follow

Boring on purpose:

  1. Watch for 30 days after release. Early "groundbreaking" models routinely develop ugly gotchas in the first month: tokenizer bugs, license surprises, benchmark-only sparkle.
  2. Wait for reproducible Ollama quants. Until the community has q4_K_M quants that run identically across hosts, we're not shipping it to production on shared hardware.
  3. Run a week on our own infrastructure. If it can't stay stable on the same shared box every other workload runs on, it won't be stable for a client either.
  4. Measure on real workloads, not leaderboards. Does it answer the plumbing-customer FAQ better than what's already deployed? That's the number that matters, not MMLU.

Gemma 4 is at step 3 as of this writing. If it completes the gauntlet, we switch. If not, we don't.

What to do if you're currently running something stable

Don't upgrade just because a new model shipped. The "boring" answer is nearly always correct: measure, compare on your actual workload, switch only if the win is visible to users.

If evaluating vendors for new work and they tell you they always use "the latest model" — ask what they used a year ago. If the answer is a model that was trendy and didn't age well, there's your signal.

— Alex Kargin. More engineering writing at kargin-utkin.com.

Next step

Thirty-minute scoping call. No pitch, no retainer.

Tell us what eats your week. We'll tell you, honestly, whether AI can help — and if it can't, we'll say so.

Book a scoping callOr use the contact form →

Serving Broward & Palm Beach County, FL.