April 2, 2026·2 min read·Alex Kargin

Gemma 4 launched — why we're not using it for clients (yet)

Google shipped a new generation of open models. Great benchmarks. We're still not switching over for small-business deployments. Here's the honest reason.

gemmamodel-selectionpatience

A few weeks ago, Google released Gemma 4. The benchmarks are good. The open-source community is excited. And we're still running Llama 3.2 for our small-business chatbots. Here's why that's not a mistake.

What Gemma 4 does well

Strong performance on reasoning benchmarks
Solid multilingual capability out of the box
Efficient at quantized sizes

All true. All genuinely impressive.

What it doesn't change

For a small-business FAQ chatbot, the ceiling on quality was already met last year. Our plumbing demo answers customer questions correctly 95%+ of the time on a 1B-parameter model. Switching to a 7B Gemma 4 gets us to maybe 97%.

That 2% is invisible to the customer. The cost of getting it — higher RAM footprint, slower inference on CPU, mediocre Windows support in Ollama — is visible every day.

Why we wait, as a rule

Our policy on new models is boring on purpose:

Watch for 30 days after release. Most "groundbreaking" models develop ugly gotchas in the first month. Tokenizer bugs. License surprises. Benchmark-only sparkle.
Wait for Ollama quant releases. Until the community has reproducible q4_K_M quants, we're not shipping it to a client on a shared VPS.
Run it on OUR infra for a week. If we can't get it stable on shared hardware that every other project runs on, our clients can't either.
Measure on OUR real workloads, not benchmarks. Does it answer plumbing customer questions better? We don't care how it scores on MMLU.

Gemma 4 is in step 3 right now. If it finishes the gauntlet, we'll switch. If it doesn't, we won't.

What you should do

If you're running a stable small-business deployment on a model that works, don't upgrade just because a new one shipped. The "boring" answer is nearly always the right answer: measure, don't trend-chase.

If you're shopping for a chatbot vendor and they tell you they use "the latest model" — ask what they used last year. If the answer is something trendy that died, that's your answer.

What we use, what it runs on

Our live chatbot demo uses Llama 3.2, a year-old model that still holds up
It runs on Ollama with OpenAI-compatible API
Total inference cost to us per conversation: approximately zero

Book a call if you want us to scope the right model for your actual workload — not the hype-cycle model.

What Gemma 4 does well

What it doesn't change

Why we wait, as a rule

What you should do

What we use, what it runs on

Thirty-minute scoping call. No pitch, no retainer.