What actually breaks when you deploy an agent to production

Demo agents are easy.

Production agents are where you discover every assumption you did not know you were making.

We shipped three agentic systems this quarter — tool-use, multi-step planning, and memory-augmented. Here is what broke, and how we fixed it.

1. Tool calling is not deterministic

Your model will hallucinate tools. It will hallucinate arguments. It will call tools with malformed JSON and then argue with itself about whether the tool exists.

We learned to wrap every tool call in:

Schema validation — Reject anything that does not match the expected signature
Retry with backoff — Not all failures are final, but unbounded retries are just expensive loops
Circuit breakers — After N consecutive failures, degrade or stop, don’t loop forever
Structured output (Outlines, JSON mode) — Constrain the model to valid output formats

The LLM is a probabilistic system. Treat it like one.

2. Planning falls apart under ambiguity

Chain-of-thought demos feel magical. In production, users ask ambiguous questions, tools return conflicting results, and the model confidently plans nonsense.

We now design every agent with:

Explicit uncertainty paths — “If the tool returns nothing, ask the user” rather than hallucinating
Step limits — No agent gets more than N reasoning steps before escalating
Human-in-the-loop gates — For high-stakes actions, require confirmation, not inference
State introspection — Log the agent’s reasoning so when it fails, you can see why

If your agent cannot explain what it is doing, you cannot debug it when it fails.

3. Memory is harder than it looks

“We will just store the conversation history.”

That works until:

Context windows fill up (even 128k tokens is surprisingly limiting)
Irrelevant history overwrites relevant context
Sensitive data leaks in retrieved memories
Multiple users’ histories get mixed in multi-tenant deployments

We use a hybrid approach:

Recent turns in context window (sliding window with summarization)
Structured memory store (key facts, user preferences, task state)
Retrieval-augmented memory (semantic search over conversation history with access controls)
Explicit forgetting — Users should be able to delete memories, so build deletion paths from day one

Memory is a database problem, not an LLM problem.

The stack that works

For our agentic deployments, we run a mix:

Open-weight models for tool-use and reasoning (Llama 3.3, Mistral Large, Qwen 2.5)
Proprietary APIs for tasks where frontier reasoning outperforms (long-horizon planning, complex analysis)
Outlines / structured generation for deterministic tool calls
LangChain/LangGraph or custom orchestration — depends on complexity, not ideology
Prometheus/Grafana + structured logging for observability

The lesson: there is no universal agent stack. The right architecture depends on your latency constraints, cost budget, and failure tolerance.

What we would do differently

Start with evals, not architecture. Define what “good” looks like before you build.
Assume failure. Every tool call, every planning step, every memory retrieval can fail. Design graceful degradation.
Measure in production. Synthetic evals catch obvious bugs. Real users find the interesting ones.