Running AI on a $20/month VPS — what we actually measured
Can a cheap rented server really host a working AI chatbot? Three small models benchmarked on real small-business questions. Here's what held up.
Every week someone asks: "do I need a GPU, or can I run this on a cheap box?" For a focused FAQ-style workload, the answer is surprisingly often the cheap box. Here are numbers.
Setup
- Server: 2-vCPU / 4GB RAM VPS (~$20/month, standard on any cloud host)
- Software: Ollama, Apache as reverse proxy, a small Next.js frontend
- Models tested:
llama3.2:1b,qwen2.5:0.5b,phi3:mini - Workload: 7-question plumbing FAQ fed through a chat widget
Results
| Model | Disk | First-token | Full reply | Answer quality (1–5) |
|---|---|---|---|---|
| llama3.2:1b | 1.3 GB | 2.1s | 9s | 4 |
| qwen2.5:0.5b | 400 MB | 1.1s | 4s | 3 |
| phi3:mini | 2.2 GB | 3.0s | 14s | 4 |
Numbers drift by 20–40% depending on server load. If the VPS is sharing CPU with a database under pressure, expect the high end.
What actually matters
For a chatbot UX, first-token latency is the number that makes or breaks the feel. Nobody stares at an empty chat bubble for 3 seconds thinking "this is working."
Streaming fixes the perception: total time to finish stays 9 seconds, but the first word appears in 2 and the reader is absorbed while the rest generates. Perceived latency drops to seconds.
When the $20 VPS stops being enough
Three situations force a bigger stack:
- Multilingual nuance. Sub-billion-parameter models wobble on idiomatic Spanish, Russian, Mandarin in ways that matter.
- Real-time voice. Different problem entirely — streaming ASR + LLM + TTS + sub-second latency targets.
- RAG over client data. Once the bot is reading a CRM, security and retrieval architecture get more careful.
For a pure knowledge-base FAQ bot on a small website? CPU on a $20 box, with streaming, is genuinely enough.
— Alex Kargin. More engineering writing at kargin-utkin.com.