dart-studio
← All writing
·1 min read·Alex Kargin

Why we default to Ollama for small-business AI

The same toolchain that data engineers are using to run models locally turns out to be the right default for small, long-lived deployments. Here's why.

ollamalocal-llmstack

When picking a stack for a long-lived small deployment, the useful constraint to apply is: will this survive the original engineer leaving, for five years, without external intervention?

That rules out a surprising amount of 2024-era tooling. API wrappers break when providers change pricing or sunset models. Custom inference servers need ops knowledge the eventual operator won't have. Managed platforms lock in and re-price.

What survives? Boring, open, locally runnable. Ollama fits.

What Ollama actually is

A small daemon that runs open-source models on any Linux/macOS/Windows box. It exposes an OpenAI-compatible HTTP API. Any code written against OpenAI switches with a one-line change:

OPENAI_BASE_URL=http://127.0.0.1:11434/v1

That's the entire migration. Same SDK, same request shape, same response shape.

Why this matters for small deployments

  1. Vendor pricing risk goes to zero. The model is a file on disk. It runs until electricity runs out.
  2. Data stays local. For anything touching PHI, legal records, or payment data, this stops objections before they start.
  3. Swapping models is trivial. Today it's llama3.2:1b. In six months when something better ships, ollama pull <new-model> and change one env var. No contract renegotiation, no approval cycles.

Where Ollama is the wrong answer

Three cases to reach for a cloud LLM API instead:

  • Frontier-quality work in a niche language. Small open models still wobble on nuanced Spanish or Russian; GPT-4o and Claude Sonnet don't.
  • Real-time voice pipelines. The whole round-trip (ASR + LLM + TTS) gets easier on managed infrastructure.
  • Client already has credits and doesn't care about the data-sovereignty question.

The operational argument

The best part of the stack is what it doesn't have. No Kubernetes. No auto-scaler. No observability mesh. One systemd unit, one daemon, one ollama pull. A junior developer can pick up the whole thing in an afternoon, which is the real sustainability test.

— Alex Kargin. More engineering writing at kargin-utkin.com.

Next step

Thirty-minute scoping call. No pitch, no retainer.

Tell us what eats your week. We'll tell you, honestly, whether AI can help — and if it can't, we'll say so.

Book a scoping callOr use the contact form →

Serving Broward & Palm Beach County, FL.