We benchmarked six models for a client last quarter.
The leaderboard winner was their worst performer in production.
Here is why leaderboard scores lie — and how to build evals that don’t.
The leaderboard trap
Standard benchmarks (MMLU, HumanEval, GSM8K) are useful for one thing: a rough signal of general capability.
They are useless for production decisions because:
- They test academic tasks, not your tasks. A model that crushes MMLU may hallucinate on your specific domain.
- They measure average, not variance. A model with a great mean and terrible tail is a production incident waiting to happen.
- They do not test safety characteristics. Jailbreak resistance, prompt injection handling, and data leakage are invisible on leaderboards.
- They are gamed. Training on test data is common enough to be assumed.
We use leaderboards for initial filtering. Never for final selection.
What a real eval looks like
For every client engagement, we build a custom eval suite with three layers:
Task-specific accuracy
- Sample from actual production inputs
- Measure exact-match, semantic similarity, or human rating depending on the task
- Include edge cases and adversarial examples from real user behavior
Regression and safety
- Compare against baseline models on held-out prompts
- Red-team for jailbreaks, prompt injection, and data leakage
- Measure refusal robustness and toxicity on domain-relevant inputs
Operational metrics
- Latency at P50, P95, P99
- Token cost per request
- Failure rate and retry patterns
- Memory usage under load
A model that is 5% less accurate but 3x cheaper and 10x more robust wins.
The open-weight advantage in evals
With open-weight models (Llama, Mistral, Qwen), you control the full stack.
- Run the same eval on multiple checkpoints without API costs or rate limits
- Ablate specific weights and layers to understand failure modes
- Modify prompts, system instructions, and sampling parameters programmatically
- Deploy your eval infrastructure alongside production, continuously
Proprietary APIs have their place. But for rigorous evaluation, open weights are unbeatable.
A practical starting point
If you are not running custom evals yet, start here:
- Collect 100 representative real inputs from your system
- Define “correct” for each (exact match, human-rated, or rubric)
- Run them through your current model and score
- Run them through two alternatives
- Measure latency, cost, and failure rate alongside accuracy
That is already more useful than any leaderboard score.