When to stop paying OpenAI — the break-even math
Cloud APIs win at volume zero. Local models win past a certain point. Here's the specific number where the line crosses for typical workloads.
The economic question behind "should I use a cloud LLM API or run my own?" has a surprisingly specific answer. It depends on one number: calls per month.
Baseline assumption
Typical chatbot reply: ~5,000 tokens in (system prompt + context + user message), ~300 tokens out. Numbers below assume that shape; taller context or longer responses shift the break-even.
Cloud pricing (current generation)
- OpenAI
gpt-4o-mini: $0.15 / 1M input + $0.60 / 1M output. Per reply: ~$0.0009. 1,000 replies ≈ $0.90. - Anthropic Haiku: comparable, maybe 10–20% different in either direction.
At low volume the cloud bill is rounding-error small. Don't overthink the architecture.
Local model cost
- Hardware: a $20/month VPS handles a small model
- Electricity: negligible for a 1B-parameter model on modern CPU
- Human time: ~1 hour/month babysitting
All-in: roughly $30/month flat, regardless of traffic.
Where the lines cross
At current token pricing, you'd need roughly 30,000 chatbot replies a month before gpt-4o-mini becomes more expensive than a modest self-hosted alternative. That's ~1,000 conversations per day. Almost nobody hits it.
So for pure dollars, cloud wins for 99% of small deployments on paper.
Why the paper math isn't the whole math
Three reasons the decision doesn't always follow the cheaper-per-request math:
- Privacy. One PHI leak through a third-party API triggers a HIPAA conversation nobody wants. Local inference removes the conversation.
- Reliability. The cloud provider has an outage on the busiest day of your year, at 2am, and the person on-call is you. Local inference on your own box doesn't.
- Pricing risk. Cloud APIs raise prices with 30 days' notice. Your local model costs what it cost a year ago.
Rule of thumb
- <500 calls/month: cloud. Save yourself the ops.
- 500–5,000/month, privacy matters: go local.
- >5,000/month: local is cheaper and more reliable.
As with most infrastructure questions, the right answer is "measure first, decide second."
— Alex Kargin. More engineering writing at kargin-utkin.com.