Here are the numbers. No preamble.
| Model | Parameters | Generation (tok/s) | Prompt Throughput (tok/s) | |-------|-----------|-------------------|--------------------------| | deepseek-v2 | 16B (MoE) | 415.0 | 2,891.3 | | qwen3.5:9b | 9B | 160.5 | 1,284.9 | | gemma4:31b | 31B | 61.6 | 1,174.1 | | mistral-small:24b | 24B | 87.3 | 14,568.2 | | llama3.1:70b | 70B | 35.2 | 487.6 |
These are real production numbers from our dual RTX 5090 workstation. Not synthetic benchmarks, not cherry picked runs, not theoretical maximums. These are measured during actual workloads with real prompt lengths and concurrent requests.
If those numbers do not mean anything to you yet, here is the translation: 415 tokens per second means an AI model generates roughly 300 words per second. A full page of text in under two seconds. A chatbot response that appears to be instant. Batch processing thousands of customer reviews in minutes instead of hours.
The Hardware
Let us be specific about what produced these numbers.
- CPU: AMD Ryzen 9 9950X (16 cores, 32 threads, 5.72GHz boost)
- RAM: 128GB DDR5 ECC (4x 32GB, 5600MHz)
- GPU 1: NVIDIA RTX 5090 32GB GDDR7
- GPU 2: NVIDIA RTX 5090 32GB GDDR7
- Storage: 4x 2TB NVMe Gen5 SSDs
- Cooling: Custom water loop, 3x 360mm radiators (full build details)
- OS: Ubuntu 24.04 LTS
- Inference: Ollama with CUDA 12.8
Total hardware cost: approximately $4,200 for the GPUs, $800 for CPU, $400 for RAM, $600 for storage, $400 for motherboard and case, $2,000 for water cooling. Call it $8,400 total.
But the comparison that matters is not the build cost. It is the operating cost versus API pricing.
What Tokens Per Second Means for Your Business
Tokens per second is a technical metric. Let us translate it to business outcomes.
Chatbot response time. A typical chatbot response is 100 to 300 tokens. At 160 tok/s (qwen3.5:9b), that is a complete response in 0.6 to 1.9 seconds. At 415 tok/s (deepseek-v2), it is 0.24 to 0.72 seconds. The user perceives this as instant. No loading spinner. No typing indicator that lasts 5 seconds. Just an answer.
Batch document processing. Summarizing a 2,000 word document requires processing roughly 2,500 prompt tokens and generating 200 to 400 output tokens. At gemma4:31b speeds (1,174 tok/s prompt, 61.6 tok/s generation), processing one document takes about 2.1 seconds for prompt ingestion and 5 seconds for the summary. That is 7 seconds per document. For 1,000 documents: under 2 hours. On an API, that same batch costs $15 to $50 depending on the model and would still take hours due to rate limits.
Review response generation. Generating a personalized response to a customer review takes roughly 80 to 150 tokens of output. At 415 tok/s, you generate a response every 0.2 to 0.4 seconds. A business with 200 pending Google reviews can have personalized draft responses for every single one in under 80 seconds.
Real time transcription post processing. Our Whisper instance handles speech to text. The output feeds directly into a language model for summarization, action item extraction, or formatting. At local inference speeds, the post processing adds negligible latency. The transcript is summarized before the user finishes closing the recording app.
The Optimization Journey That Got Us Here
The numbers at the top of this post did not happen on day one. Our first benchmarks after building the system were significantly worse. Here is the optimization path.
### Discovery 1: The CPU Governor Problem
This was the single biggest performance issue we found, and it had nothing to do with the GPUs.
Linux defaults to a "powersave" CPU frequency governor on many distributions. This governor scales the CPU down to its minimum frequency when load is low and scales up gradually under load. The problem is that Ollama's prompt processing is CPU intensive during the tokenization and KV cache population phases. The powersave governor was keeping our Ryzen 9 9950X at 3.3GHz instead of its 5.72GHz boost clock.
We discovered this by accident while running htop during a benchmark. The CPU frequency column showed 3.3GHz across all cores when we expected 5.7GHz.
The fix:
\
