What 415 Tokens Per Second Looks Like on a $4,000 GPU Rig

Here are the numbers. No preamble.

| Model | Parameters | Generation (tok/s) | Prompt Throughput (tok/s) | |-------|-----------|-------------------|--------------------------| | deepseek-v2 | 16B (MoE) | 415.0 | 2,891.3 | | qwen3.5:9b | 9B | 160.5 | 1,284.9 | | gemma4:31b | 31B | 61.6 | 1,174.1 | | mistral-small:24b | 24B | 87.3 | 14,568.2 | | llama3.1:70b | 70B | 35.2 | 487.6 |

These are real production numbers from our dual RTX 5090 workstation. Not synthetic benchmarks, not cherry picked runs, not theoretical maximums. These are measured during actual workloads with real prompt lengths and concurrent requests.

If those numbers do not mean anything to you yet, here is the translation: 415 tokens per second means an AI model generates roughly 300 words per second. A full page of text in under two seconds. A chatbot response that appears to be instant. Batch processing thousands of customer reviews in minutes instead of hours.

The Hardware

Let us be specific about what produced these numbers.

CPU: AMD Ryzen 9 9950X (16 cores, 32 threads, 5.72GHz boost)
RAM: 128GB DDR5 ECC (4x 32GB, 5600MHz)
GPU 1: NVIDIA RTX 5090 32GB GDDR7
GPU 2: NVIDIA RTX 5090 32GB GDDR7
Storage: 4x 2TB NVMe Gen5 SSDs
Cooling: Custom water loop, 3x 360mm radiators (full build details)
OS: Ubuntu 24.04 LTS
Inference: Ollama with CUDA 12.8

Total hardware cost: approximately $4,200 for the GPUs, $800 for CPU, $400 for RAM, $600 for storage, $400 for motherboard and case, $2,000 for water cooling. Call it $8,400 total.

But the comparison that matters is not the build cost. It is the operating cost versus API pricing.

What Tokens Per Second Means for Your Business

Tokens per second is a technical metric. Let us translate it to business outcomes.

Chatbot response time. A typical chatbot response is 100 to 300 tokens. At 160 tok/s (qwen3.5:9b), that is a complete response in 0.6 to 1.9 seconds. At 415 tok/s (deepseek-v2), it is 0.24 to 0.72 seconds. The user perceives this as instant. No loading spinner. No typing indicator that lasts 5 seconds. Just an answer.

Batch document processing. Summarizing a 2,000 word document requires processing roughly 2,500 prompt tokens and generating 200 to 400 output tokens. At gemma4:31b speeds (1,174 tok/s prompt, 61.6 tok/s generation), processing one document takes about 2.1 seconds for prompt ingestion and 5 seconds for the summary. That is 7 seconds per document. For 1,000 documents: under 2 hours. On an API, that same batch costs $15 to $50 depending on the model and would still take hours due to rate limits.

Review response generation. Generating a personalized response to a customer review takes roughly 80 to 150 tokens of output. At 415 tok/s, you generate a response every 0.2 to 0.4 seconds. A business with 200 pending Google reviews can have personalized draft responses for every single one in under 80 seconds.

Real time transcription post processing. Our Whisper instance handles speech to text. The output feeds directly into a language model for summarization, action item extraction, or formatting. At local inference speeds, the post processing adds negligible latency. The transcript is summarized before the user finishes closing the recording app.

The Optimization Journey That Got Us Here

The numbers at the top of this post did not happen on day one. Our first benchmarks after building the system were significantly worse. Here is the optimization path.

### Discovery 1: The CPU Governor Problem

This was the single biggest performance issue we found, and it had nothing to do with the GPUs.

Linux defaults to a "powersave" CPU frequency governor on many distributions. This governor scales the CPU down to its minimum frequency when load is low and scales up gradually under load. The problem is that Ollama's prompt processing is CPU intensive during the tokenization and KV cache population phases. The powersave governor was keeping our Ryzen 9 9950X at 3.3GHz instead of its 5.72GHz boost clock.

We discovered this by accident while running htop during a benchmark. The CPU frequency column showed 3.3GHz across all cores when we expected 5.7GHz.

The fix:

The Hardware

Let us be specific about what produced these numbers.

CPU: AMD Ryzen 9 9950X (16 cores, 32 threads, 5.72GHz boost)

RAM: 128GB DDR5 ECC (4x 32GB, 5600MHz)

GPU 1: NVIDIA RTX 5090 32GB GDDR7

GPU 2: NVIDIA RTX 5090 32GB GDDR7

Storage: 4x 2TB NVMe Gen5 SSDs

Cooling: Custom water loop, 3x 360mm radiators (full build details)

OS: Ubuntu 24.04 LTS

Inference: Ollama with CUDA 12.8

Total hardware cost: approximately $4,200 for the GPUs, $800 for CPU, $400 for RAM, $600 for storage, $400 for motherboard and case, $2,000 for water cooling. Call it $8,400 total.

But the comparison that matters is not the build cost. It is the operating cost versus API pricing.

What Tokens Per Second Means for Your Business

Tokens per second is a technical metric. Let us translate it to business outcomes.

The Optimization Journey That Got Us Here

The numbers at the top of this post did not happen on day one. Our first benchmarks after building the system were significantly worse. Here is the optimization path.

### Discovery 1: The CPU Governor Problem

This was the single biggest performance issue we found, and it had nothing to do with the GPUs.

We discovered this by accident while running htop during a benchmark. The CPU frequency column showed 3.3GHz across all cores when we expected 5.7GHz.

The fix:

What 415 Tokens Per Second Looks Like on a $4,000 GPU Rig

The Hardware

What Tokens Per Second Means for Your Business

The Optimization Journey That Got Us Here

Want to see what AI can do for your business?

Related posts

What 415 Tokens Per Second Looks Like on a $4,000 GPU Rig

The Hardware

What Tokens Per Second Means for Your Business

The Optimization Journey That Got Us Here

Want to see what AI can do for your business?

Related posts