Fine-tuning open-weight models is 80% data curation, 20% training

We shipped our tenth production LoRA fine-tuning pipeline last month.

The pattern never changes: teams spend weeks debating architecture, then throw garbage data at it and wonder why the model hallucinates.

Here is what actually matters.

1. Your data is worse than you think

We have never seen a “ready” dataset.

Before any training run, we spend 60-80% of the project on data work:

Deduplication — Near-duplicates silently poison your loss curves
Quality filtering — Outliers and formatting errors amplify during training
Distribution analysis — Biased data produces biased models, end of story
Synthetic augmentation — When done right, 10% synthetic + 90% real beats 100% real. When done wrong, it is poison.

Data curation is not preprocessing. It is the core engineering task.

2. Architecture decisions matter less than eval decisions

Every team asks: “LoRA or full fine-tune? QLoRA? What rank?”

These are optimization questions, not success questions.

The question that really matters: “How will we know if this model is better?”

We build custom eval harnesses before the first training run. Not standard benchmarks — custom ones tied to the actual task. If we cannot measure it, we cannot improve it.

Our standard pipeline: hold-out task-specific evals, regression checks against base model, and human-in-the-loop spot checks. Anything less is guessing.

3. Open-weight is not a compromise — it is a capability

We run Llama, Mistral, Qwen, and DeepSeek in production. Not because clients can’t afford APIs. Because open weights give you:

Cost transparency — No per-token surprises at scale
Latency control — Inference on your hardware, your network
Data sovereignty — No data leaves your environment
Iteration speed — Train, test, deploy without vendor approval cycles

The “open vs. closed” debate is a false binary. We use both. The right model for the job, not ideology.

The single pattern

Every successful fine-tuning project follows the same sequence:

Define the task precisely (what does “better” mean?)
Build the eval harness first
Curate the data obsessively
Train with disciplined logging
Evaluate against baseline + regressions
Deploy with monitoring
Iterate

Skip any step and you are gambling.