Why we benchmark on custom evals, not standard leaderboards

We benchmarked six models for a client last quarter.

The leaderboard winner was their worst performer in production.

Here is why leaderboard scores lie — and how to build evals that don’t.

The leaderboard trap

Standard benchmarks (MMLU, HumanEval, GSM8K) are useful for one thing: a rough signal of general capability.

They are useless for production decisions because:

They test academic tasks, not your tasks. A model that crushes MMLU may hallucinate on your specific domain.
They measure average, not variance. A model with a great mean and terrible tail is a production incident waiting to happen.
They do not test safety characteristics. Jailbreak resistance, prompt injection handling, and data leakage are invisible on leaderboards.
They are gamed. Training on test data is common enough to be assumed.

We use leaderboards for initial filtering. Never for final selection.

For every client engagement, we build a custom eval suite with three layers:

A model that is 5% less accurate but 3x cheaper and 10x more robust wins.

With open-weight models (Llama, Mistral, Qwen), you control the full stack.

Proprietary APIs have their place. But for rigorous evaluation, open weights are unbeatable.

If you are not running custom evals yet, start here:

That is already more useful than any leaderboard score.