Insight

Why we benchmark on custom evals, not standard leaderboards

Originally posted on X →

We benchmarked six models for a client last quarter.

The leaderboard winner was their worst performer in production.

Here is why leaderboard scores lie — and how to build evals that don’t.


The leaderboard trap

Standard benchmarks (MMLU, HumanEval, GSM8K) are useful for one thing: a rough signal of general capability.

They are useless for production decisions because:

  • They test academic tasks, not your tasks. A model that crushes MMLU may hallucinate on your specific domain.
  • They measure average, not variance. A model with a great mean and terrible tail is a production incident waiting to happen.
  • They do not test safety characteristics. Jailbreak resistance, prompt injection handling, and data leakage are invisible on leaderboards.
  • They are gamed. Training on test data is common enough to be assumed.

We use leaderboards for initial filtering. Never for final selection.


What a real eval looks like

For every client engagement, we build a custom eval suite with three layers:

Task-specific accuracy

  • Sample from actual production inputs
  • Measure exact-match, semantic similarity, or human rating depending on the task
  • Include edge cases and adversarial examples from real user behavior

Regression and safety

  • Compare against baseline models on held-out prompts
  • Red-team for jailbreaks, prompt injection, and data leakage
  • Measure refusal robustness and toxicity on domain-relevant inputs

Operational metrics

  • Latency at P50, P95, P99
  • Token cost per request
  • Failure rate and retry patterns
  • Memory usage under load

A model that is 5% less accurate but 3x cheaper and 10x more robust wins.


The open-weight advantage in evals

With open-weight models (Llama, Mistral, Qwen), you control the full stack.

  • Run the same eval on multiple checkpoints without API costs or rate limits
  • Ablate specific weights and layers to understand failure modes
  • Modify prompts, system instructions, and sampling parameters programmatically
  • Deploy your eval infrastructure alongside production, continuously

Proprietary APIs have their place. But for rigorous evaluation, open weights are unbeatable.


A practical starting point

If you are not running custom evals yet, start here:

  1. Collect 100 representative real inputs from your system
  2. Define “correct” for each (exact match, human-rated, or rubric)
  3. Run them through your current model and score
  4. Run them through two alternatives
  5. Measure latency, cost, and failure rate alongside accuracy

That is already more useful than any leaderboard score.