Demo agents are easy.
Production agents are where you discover every assumption you did not know you were making.
We shipped three agentic systems this quarter — tool-use, multi-step planning, and memory-augmented. Here is what broke, and how we fixed it.
1. Tool calling is not deterministic
Your model will hallucinate tools. It will hallucinate arguments. It will call tools with malformed JSON and then argue with itself about whether the tool exists.
We learned to wrap every tool call in:
- Schema validation — Reject anything that does not match the expected signature
- Retry with backoff — Not all failures are final, but unbounded retries are just expensive loops
- Circuit breakers — After N consecutive failures, degrade or stop, don’t loop forever
- Structured output (Outlines, JSON mode) — Constrain the model to valid output formats
The LLM is a probabilistic system. Treat it like one.
2. Planning falls apart under ambiguity
Chain-of-thought demos feel magical. In production, users ask ambiguous questions, tools return conflicting results, and the model confidently plans nonsense.
We now design every agent with:
- Explicit uncertainty paths — “If the tool returns nothing, ask the user” rather than hallucinating
- Step limits — No agent gets more than N reasoning steps before escalating
- Human-in-the-loop gates — For high-stakes actions, require confirmation, not inference
- State introspection — Log the agent’s reasoning so when it fails, you can see why
If your agent cannot explain what it is doing, you cannot debug it when it fails.
3. Memory is harder than it looks
“We will just store the conversation history.”
That works until:
- Context windows fill up (even 128k tokens is surprisingly limiting)
- Irrelevant history overwrites relevant context
- Sensitive data leaks in retrieved memories
- Multiple users’ histories get mixed in multi-tenant deployments
We use a hybrid approach:
- Recent turns in context window (sliding window with summarization)
- Structured memory store (key facts, user preferences, task state)
- Retrieval-augmented memory (semantic search over conversation history with access controls)
- Explicit forgetting — Users should be able to delete memories, so build deletion paths from day one
Memory is a database problem, not an LLM problem.
The stack that works
For our agentic deployments, we run a mix:
- Open-weight models for tool-use and reasoning (Llama 3.3, Mistral Large, Qwen 2.5)
- Proprietary APIs for tasks where frontier reasoning outperforms (long-horizon planning, complex analysis)
- Outlines / structured generation for deterministic tool calls
- LangChain/LangGraph or custom orchestration — depends on complexity, not ideology
- Prometheus/Grafana + structured logging for observability
The lesson: there is no universal agent stack. The right architecture depends on your latency constraints, cost budget, and failure tolerance.
What we would do differently
- Start with evals, not architecture. Define what “good” looks like before you build.
- Assume failure. Every tool call, every planning step, every memory retrieval can fail. Design graceful degradation.
- Measure in production. Synthetic evals catch obvious bugs. Real users find the interesting ones.