Gemma 4 launched — why we're not using it for clients (yet)
Google shipped a new generation of open models. Great benchmarks. We're still not switching over for small-business deployments. Here's the honest reason.
A few weeks ago, Google released Gemma 4. The benchmarks are good. The open-source community is excited. And we're still running Llama 3.2 for our small-business chatbots. Here's why that's not a mistake.
What Gemma 4 does well
- Strong performance on reasoning benchmarks
- Solid multilingual capability out of the box
- Efficient at quantized sizes
All true. All genuinely impressive.
What it doesn't change
For a small-business FAQ chatbot, the ceiling on quality was already met last year. Our plumbing demo answers customer questions correctly 95%+ of the time on a 1B-parameter model. Switching to a 7B Gemma 4 gets us to maybe 97%.
That 2% is invisible to the customer. The cost of getting it — higher RAM footprint, slower inference on CPU, mediocre Windows support in Ollama — is visible every day.
Why we wait, as a rule
Our policy on new models is boring on purpose:
- Watch for 30 days after release. Most "groundbreaking" models develop ugly gotchas in the first month. Tokenizer bugs. License surprises. Benchmark-only sparkle.
- Wait for Ollama quant releases. Until the community has reproducible q4_K_M quants, we're not shipping it to a client on a shared VPS.
- Run it on OUR infra for a week. If we can't get it stable on shared hardware that every other project runs on, our clients can't either.
- Measure on OUR real workloads, not benchmarks. Does it answer plumbing customer questions better? We don't care how it scores on MMLU.
Gemma 4 is in step 3 right now. If it finishes the gauntlet, we'll switch. If it doesn't, we won't.
What you should do
If you're running a stable small-business deployment on a model that works, don't upgrade just because a new one shipped. The "boring" answer is nearly always the right answer: measure, don't trend-chase.
If you're shopping for a chatbot vendor and they tell you they use "the latest model" — ask what they used last year. If the answer is something trendy that died, that's your answer.
What we use, what it runs on
- Our live chatbot demo uses Llama 3.2, a year-old model that still holds up
- It runs on Ollama with OpenAI-compatible API
- Total inference cost to us per conversation: approximately zero
Book a call if you want us to scope the right model for your actual workload — not the hype-cycle model.