Three Layers of AI Engineering — Blog 4: Harness Engineering — The Loop Around the Model#
In Post 2, the Contract Analyzer prompt got the system to 98% reliability. In Post 3, context engineering closed the long-document failure mode and dropped cost ~60% without touching the prompt. The two remaining failures from the running scenario — the 2% conversational preamble, and the inability to reason across clauses — don't live on either of those layers. They live in the harness.
The harness is everything that wraps the model call: the parameters that govern invocation, the tools the model can reach for, the sub-agents you fan out to, the hooks that enforce stop conditions, the memory that persists across turns, the parsers that turn probabilistic output into deterministic data, and the observability that tells you what actually happened. A single LLM call has no harness — you send the prompt, you get the response, you're done. An agent is, almost by definition, an LLM plus a harness.
This post walks through the harness layer using two of my projects as worked examples: the parallel sub-agent dispatch from Deep Research Agent, and the OpenTelemetry tracing scheme from Agent Observability. By the end, the Same Query box has its third column filled in, and the running scenario is fully solved.
The Three Layers of AI Engineering Series#
| Part | Title | Focus |
|---|---|---|
| 1 | Why I Stopped Calling It Prompt Engineering | The three-layer thesis, the running scenario, the decision question |
| 2 | Prompt Engineering — What's Still in Scope | System-as-contract, schema-first, few-shot, negative space |
| 3 | Context Engineering — Curating What the Model Sees | Complexity, compression, dedup, memory, lost-in-the-middle |
| 4 | Harness Engineering — The Loop Around the Model (this post) | Tools, sub-agents, hooks, schedulers, parsers, memory, observability |
| 5 | Picking the Right Layer: A Decision Framework | Diagnostic flowchart, iteration cost per layer, case studies across five projects |
What "Harness" Means in This Post#
A definition narrow enough to work with:
The harness is the runtime code that surrounds the model call — the parameters, dispatch, sub-agents, parsers, hooks, schedulers, memory, and instrumentation that turn a probabilistic text generator into a system you can build on.
What's in scope:
- The agent loop itself (the gather-context → act → observe → repeat cycle)
- Model invocation parameters (
max_tokens, temperature, top_p, retries, timeouts, thinking budget, batch vs realtime, cache breakpoints) - The SDK or framework that implements your loop (Claude Agent SDK, OpenAI Agents SDK, LangGraph, Mastra, Pydantic AI, etc.)
- Tool definitions, tool selection, tool dispatch — increasingly via MCP servers rather than ad-hoc per-app definitions
- Sub-agent decomposition and topology (orchestrator-worker, swarm/handoffs, hierarchical, network/A2A)
- Output parsers and validators stacked into L1/L2/L3 (Pydantic / Instructor retry-with-feedback / constrained decoding)
- Hooks and lifecycle events (PreToolUse, PostToolUse, UserPromptSubmit, Stop, SubagentStop, PreCompact)
- Permissions, allow/deny-lists, sandboxing (the "model proposes, harness disposes" pattern)
- Memory persistence between turns — Mem0, Letta, Zep, LangMem; storage backend is harness, what's injected is context
- Durable execution for long-horizon agents (Temporal, Inngest, AWS Lambda Durable Functions)
- Skills as discoverable, lazy-loaded capability packages
- Streaming events (
text_delta,thinking_delta,tool_useblocks), keepalives, backpressure - Cost tracking and budget-aware routing (LiteLLM, Bifrost, Helicone)
- OpenTelemetry GenAI tracing (
gen_ai.*semantic conventions), Langfuse / Phoenix / Braintrust - Evaluation — LLM-as-judge in the loop, regression suites in CI
What's not in scope:
- The instruction string → prompt layer (Post 2)
- What enters the context window before invocation → context layer (Post 3)
The boundary is sharper than people usually treat it. A retry policy is a harness concern. What you say to the model on the retry is a prompt concern. A sub-agent dispatch is a harness concern. What instruction each sub-agent receives is a prompt concern. Most agent bugs that "feel like a prompt problem" are actually harness bugs misdiagnosed because the harness is invisible.
One more boundary worth marking: persistent memory is a harness concern (the storage, retrieval, and write-policy code lives outside the model call), but which retrieved memory item gets injected into the prompt is a context-engineering decision. Mem0, Letta, Zep, LangMem are harness components. The decision "this fact, not that one, goes in the window" is context engineering. Conflating the two is how teams end up with great memory infrastructure that surfaces the wrong information to the model.
The Loop Itself#
Before any of the techniques in this post make sense, the loop they wrap has to be named.
The canonical agent loop, the way every modern SDK (Claude Agent SDK, OpenAI Agents SDK, LangGraph, Mastra) implements it:
loop:
context = gather_context(history, memory, retrieval, tools)
response = model(context)
if response is a final answer:
return response
if response is a tool call:
result = dispatch_tool(response.tool, response.args)
history.append(response, result)
continue loop
if max_iterations exceeded OR cost ceiling reached OR no progress:
break with whatever we have
That's it. Four lines of pseudocode that distinguish a chain (one model call, done) from an agent (a loop of model calls, each one informed by the results of the last). Every technique in this post is scaffolding around that loop.
Three things about the loop deserve calling out, because they're where most production agents break:
1. The terminal condition is not a prompt concern. The model is not deciding when to stop; the harness is. The model can suggest "I think I'm done" via a tool call or a structured output verb, but the harness reads that verb and breaks the loop. Three terminal conditions you need, regardless of SDK:
- A hard iteration cap (typically 10-30 steps; anything more usually means the loop is broken)
- A cost ceiling (in dollars, in tokens, or in tool-call count)
- A no-progress detector (if the last 3 iterations produced the same tool call with the same args, you're in a loop the model can't break out of)
The third one is the silent killer in production. The first two are easy; the third requires harness state.
2. "Gather context" is itself a context-engineering decision — what enters the window on each iteration, how old turns get compacted (see Post 3), which tool definitions get included, what memory gets retrieved. The harness owns the act of gathering; the context layer owns the curation policy.
3. The loop has to be observable. If you can't see what the model proposed, what tool it called, what the tool returned, and how the harness decided to continue or stop — at every iteration — you cannot debug a misbehaving agent. Technique 6 (tracing) is what makes this work.
The rest of this post fills in the techniques. Keep the loop in your head; the techniques are about what you wrap around each step of it.
Sources: Anthropic — Claude Agent SDK overview, Anthropic — Effective harnesses for long-running agents.
Technique 1: Decoder Parameters Are Harness, Not Prompt#
The simplest harness technique is the one most people lump into "prompt engineering" by default: the model invocation parameters.
model = BedrockModel( model_id=BEDROCK_MODEL_ID, region_name=AWS_REGION, max_tokens=8192, temperature=0.1, )
Four numbers and three identifiers. None of them are part of the prompt. All of them shape what comes back.
The max_tokens=8192 decision from Contract Analyzer is the textbook harness bug from Post 1: my first version used max_tokens=2048 by muscle memory, and the fifth test run came back with 27 clauses instead of 41 and an unparseable trailing {. The model wasn't hallucinating — the runtime had silently truncated it. The fix is a calculation: (tokens-per-item × cardinality × 1.5). For 41 clauses at ~150 tokens each, you need ~9,200 tokens of output headroom. 2048 isn't even close.
The temperature=0.1 decision is more interesting. From the scoring post: "At 0.1 for the summary call, every summary started with 'This contract presents a medium risk profile...' — every one, robotic and indistinguishable across contracts." The extraction call wants determinism (testable, cacheable, debuggable). The summary call wants prose with voice. Same model, same data, different harness parameter — different system behavior. The author of the prompt never has to know.
The general rule: every decoder parameter is a system design decision. Don't accept defaults. Estimate your output budget. Pick a temperature that matches the task (classification → 0.1; structured generation → 0.3; creative generation → 0.7). Set timeouts that match your user-facing latency budget, not the SDK's default of "forever."
Two 2026 invocation modes that belong in this section. "Calling the model" used to mean one thing: a synchronous request. In 2026, "calling the model" is at least three things, and the choice between them is harness configuration.
- Realtime invocation — what the Contract Analyzer uses. User is waiting; you pay full price; latency matters.
- Batch invocation — Anthropic's Message Batches API bundles up to 100,000 requests, returns results within 24 hours, and gives a 50% discount on input + output tokens. For any non-interactive workload (overnight contract-portfolio re-scoring, bulk eval runs, retrospective analysis), this is the default. The harness change is one method call; the cost change is half.
- Cache-aware realtime — covered in detail in Post 3 Technique 6. The harness sets
cache_controlbreakpoints; the runtime charges 0.1x base for cached reads vs. 1x for fresh input. Cache writes cost 1.25x (5-min TTL) or 2x (1-hour TTL). A typical agent with a 30,000-token system prompt + tool defs pays 90% less per call after the first one.
Both batch and caching are harness primitives the agent code controls — not capabilities the model exposes. Picking the wrong invocation mode for your workload is a silent ~50% cost burn that no prompt or context change will recover.
Sources: Anthropic — Message Batches API, Anthropic — Prompt caching.
Technique 2: Tool Dispatch and Source Capture#
The most consequential thing the harness gives a model is the ability to do things it couldn't do alone. Tools — function calls — turn an LLM from a text generator into an actor.
The dispatch is harness work. The definition of which tools exist, which tools to expose for a given call, and how to wrap them is all outside the prompt.
Deep Research Agent uses an elegant pattern for this: tool wrappers that capture metadata as a side effect of being called. Each researcher agent has access to a tavily_search tool, but the version the agent sees is a closure over a captured_sources list:
def _create_source_capturing_tools(captured_sources: list[dict]): @tool def capturing_tavily_search(query: str, max_results: int = 5) -> dict: """Search the web using Tavily API.""" result = tavily_search(query=query, max_results=max_results) for r in result.get("results", []): url = r.get("url", "") if url and not any(s["url"] == url for s in captured_sources): domain_match = re.match(r'https?://(?:www\.)?([^/]+)', url) captured_sources.append({ "url": url, "title": r.get("title", ""), "domain": domain_match.group(1) if domain_match else url, "snippet": r.get("content", "")[:200], "credibility_score": 50, "credibility_tier": "tier3", }) return result return capturing_tavily_search
Three things to notice.
The agent doesn't know the wrapper exists. From the model's perspective, it called tavily_search. The harness intercepted the call, did the search, and also updated a shared list of every source the agent has touched. The model is doing the research; the harness is doing the bookkeeping.
Source citations come from the harness, not the prompt. Most "agent that cites its sources" implementations try to make the model emit citations in its output. That's prompt engineering, and it's brittle — the model will sometimes hallucinate citations, sometimes drop them, sometimes get the URLs slightly wrong. Capturing sources at the tool layer is bulletproof. The model can't lie about what it didn't call.
The same pattern generalizes to any side-channel telemetry. Want to log every tool call for an audit? Wrap. Want to enforce rate limits per tool per agent? Wrap. Want to swap a real API call for a mock in tests? Wrap. The tool wrapper is one of the highest-leverage patterns in agent design, and it lives entirely in the harness.
The full Deep Research orchestration post has the rest, including credibility scoring as a separate captured tool. Important point for this series: citation accuracy is a harness concern, not a prompt one.
Where SDKs Fit (and Why Most of This Is Now a Commodity)#
Everything in this post — decoder params, tool dispatch, sub-agent decomposition, defensive parsing, hooks, observability — is what the 2026 agent SDKs implement for you. The techniques are real and durable; the code you write to implement them is mostly not your differentiator anymore.
A rough landscape, mid-2026:
| SDK | Camp | Strengths |
|---|---|---|
| Claude Agent SDK (Anthropic) | Python / TypeScript | Exposes the harness Claude Code runs on; first-class MCP client, hooks system, permission modes |
| OpenAI Agents SDK (~26k stars; evolved from Swarm) | Python / TypeScript | Agent / Handoff / Guardrail primitives; formalized sub-agents in April 2026 |
| LangGraph | Python / TypeScript | State-machine graph model; interrupt nodes; deep enterprise adoption (Uber, Klarna, LinkedIn, JPMorgan) |
| Mastra | TypeScript (de facto) | 19k stars, 300k weekly npm downloads; production-grade workflow primitives |
| Pydantic AI | Python | Typed agents, native Pydantic validation, eval-driven from day one |
| CrewAI | Python | Role-based "crew" abstraction; popular for orchestrator-of-specialists patterns |
| AutoGen 0.4+ (Microsoft) | Python | Code execution focus; default sandboxed execution |
| SmolAgents (Hugging Face) | Python | "Code as action" — the agent writes Python instead of emitting tool-call JSON |
The consolidation is real. Where in 2024 you might have rolled your own agent loop, in 2026 the question is "which SDK?" — and the answer depends mostly on language (Python vs. TypeScript), ecosystem (Anthropic vs. OpenAI vs. multi-provider via LiteLLM), and which patterns the SDK makes idiomatic vs. awkward.
Two practical consequences for this post:
1. The techniques don't change; the implementation does. A hook in Claude Code is called PreToolUse. The same primitive in OpenAI Agents SDK is a Guardrail. In Mastra it's a workflow hook. In LangGraph it's an interrupt node. The semantic primitive — "deterministic code that runs before/after a model step and can halt or modify the flow" — is identical across all of them. The mental model in this post is portable; the syntax isn't.
2. The bugs change shape, too. Custom-harness bugs are usually orchestration mistakes (the asyncio.shield story from Technique 3). SDK-harness bugs are usually framework-version mismatches and dependency upgrades. Both require harness-debugging skill, but the symptoms are different.
The Contract Analyzer doesn't use a heavyweight agent SDK because it's a single-shot extraction (no loop). If we extended it to support multi-turn legal-assistant workflow with cross-clause sub-agent fan-out, the right move would be to pick a Python SDK (Claude Agent SDK or LangGraph, depending on whether we want graph-shaped control flow) rather than continue to hand-roll the loop.
Sources: Anthropic — Building agents with the Claude Agent SDK, OpenAI Agents SDK docs, LangGraph docs, Mastra docs.
MCP: Tool Definition as Protocol#
The single biggest change to harness engineering between 2024 and 2026 is the Model Context Protocol (MCP). It went from "Anthropic's new spec" in late 2024 to 97 million monthly SDK downloads and ~10,000 active public servers by mid-2026, with 41% of surveyed software organizations reporting limited or broad production use. It's been donated to the Linux Foundation's Agentic AI Foundation. It's adopted across Anthropic, OpenAI, Google DeepMind, Microsoft, Salesforce, Cloudflare, Replit, and most other major model and product platforms.
What MCP is, in one sentence: an open protocol for exposing tools, data, and prompts to LLMs over a standard interface — so a tool implementation written once can be called by any agent in any SDK on any model.
For harness engineering, three things change:
1. Tool definitions become a portable artifact. Before MCP, every agent framework had its own tool-definition format. With MCP, you ship an mcp-server-foo (in Python, TypeScript, or any language with a binding) and any compliant client — Claude Code, Cursor, OpenAI Agents SDK, LangGraph, Pydantic AI, custom code — can call it. Microsoft's Playwright MCP server is a canonical example: one server gives every agent in the ecosystem browser-automation tools, using accessibility-tree representations that are ~4x more token-efficient than screenshot-based computer use.
2. Tool-set size is now a context-budget decision. A 30-tool MCP server can blow 8-15k tokens before the user message even appears — every tool's name, description, and schema lives in the system block on every call. Loading all available MCP servers all the time is the equivalent of importing every Python package on every script invocation: technically possible, ruinously expensive. The 2026 pattern is just-in-time tool loading — connect MCP servers per-task, not per-session.
3. The security model is non-trivial. Two transports (stdio for local subprocess, Streamable HTTP for remote — replaced the older HTTP+SSE in the 2025-03-26 spec). Remote MCP requires Origin header validation (DNS rebinding protection), localhost-only binding for local servers, and OAuth 2.1 with PKCE + resource indicators. Token passthrough is explicitly forbidden ("confused deputy" attack). If you're operating MCP servers in production, the security checklist matters more than the protocol details.
For Contract Analyzer, the natural MCP wrap would be: a contract-analyzer-mcp server exposing extract_clauses, score_risk, summarize_contract, and compare_clauses tools. Any agent in any framework could then use Contract Analyzer as a tool — including, eventually, a legal-research super-agent that composes contract analysis with case-law search and statutory lookup, each provisioned via its own MCP server.
Sources: Model Context Protocol — Spec, MCP Release Candidate notes, MCP downloads/adoption data.
Technique 3: Sub-Agent Decomposition#
The single most powerful harness technique is breaking a hard problem into smaller problems run in parallel. This is what fixes the "multi-clause reasoning" failure mode the Contract Analyzer extraction had.
Deep Research Agent's pattern: one orchestrator decomposes the query into N sub-queries, fans them out to N parallel researchers, then a critique agent reviews and a synthesizer consolidates.
research_tasks = [ _research_sub_query(sq, i) for i, sq in enumerate(sub_queries) ] gather_future = asyncio.gather(*research_tasks) while not gather_future.done(): try: research_results = await asyncio.wait_for( asyncio.shield(gather_future), timeout=10.0 ) break except asyncio.TimeoutError: yield {"type": "keepalive"} else: research_results = gather_future.result()
Two harness details in this code are easy to miss.
asyncio.shield is load-bearing. Without it, asyncio.wait_for cancels the gather on timeout — meaning your keepalive loop would also kill the underlying research tasks. With shield, only the wait breaks; the research tasks keep running. This is the difference between "I sent a keepalive" and "I sent a keepalive and accidentally cancelled all my work."
Keepalives are a harness response to a deployment constraint. CloudFront has a 60-second idle timeout. If a long-running research pipeline goes silent for >60s, the connection drops and the user sees an error even though the work is succeeding. The harness emits {"type": "keepalive"} events every 10s during the wait — the prompt has no idea this is happening, and shouldn't.
Sub-agent decomposition is also where the multi-clause reasoning fix lives for Contract Analyzer. The pattern: take the extracted clauses, identify cross-clause dependencies (liability cap depends on indemnification, termination depends on notice period), and dispatch a second-pass agent per dependency pair with the relevant clauses in context. The first-pass agent doesn't know any of this exists; it just extracts. The second-pass agent doesn't see the full contract; it just reasons over the pair. The harness orchestrates between them.
The timing on Deep Research: ~20 seconds total, 6 agent calls (1 decomposition + 4 parallel researchers + 1 critique + 1 synthesis). If those four researchers ran sequentially instead of in parallel, total wall time would be ~50 seconds. That parallelism is a harness decision. No prompt changed.
Four topologies, not one#
The Deep Research pattern is one of four canonical sub-agent topologies that the 2026 SDK ecosystem has converged on. Picking the right one is a harness-design decision that maps to the shape of your problem:
| Topology | Shape | Best for | Implementations |
|---|---|---|---|
| Orchestrator-worker (supervisor) | One orchestrator dispatches N parallel workers, then a synthesizer consolidates | Decomposable research, parallel data gathering, fan-out/fan-in | Claude Agent SDK, Google ADK, what Deep Research uses |
| Swarm / handoffs | Control transfers permanently between agents based on the current task | Triage-then-specialist routing (e.g., support → billing → engineering) | OpenAI Agents SDK's Handoff primitive |
| Hierarchical | Multi-level orchestrators (orchestrators-of-orchestrators) | Deep research, large codebases, anything with natural taxonomic structure | Custom on top of any SDK |
| Network / peer-to-peer | Agents discover and delegate to each other via a shared protocol | Cross-org agent collaboration, multi-tenant systems | A2A protocol (Google, formerly ACP — merged into A2A under the Linux Foundation in Sept 2025) |
The number that justifies the rest of this post: Anthropic's multi-agent research system paper reports their orchestrator-worker setup outperformed a single Claude Opus 4 agent by 90.2% on internal research evals — at ~15x token cost. That 15x cost number is exactly why every guardrail in Technique 5 (step ceiling, cost ceiling, no-progress detector) exists. Sub-agent decomposition unlocks capabilities that single agents can't reach; it also makes cost runaway the dominant failure mode if you don't put deterministic stops in place.
For Contract Analyzer's multi-clause reasoning extension, the natural choice is orchestrator-worker: a coordinating agent identifies clause-dependency pairs, dispatches a worker per pair, and consolidates the results. Each worker sees only its pair (small context, fast call); the orchestrator sees only the structure (small context, cheap call); cost stays bounded.
Technique 4: Defensive Parsing and Validation#
The five-line parser from Post 1 and Post 2 is harness code. It's worth re-stating because it's the simplest possible illustration of "harness does what prompt can't":
start = response_text.find("[") end = response_text.rfind("]") + 1 if start == -1 or end == 0: raise ValueError("No JSON array found in response") json_str = response_text[start:end] clauses = json.loads(json_str)
I rewrote the extraction prompt four times trying to suppress "Sure! Here's the analysis:" preambles. Each rewrite ticked success rate up by ~0.3% and plateaued. The five lines above closed the gap completely.
Generalizing this: the 2026 best practice is a three-layer reliability stack for structured outputs. Each layer catches a different failure mode; running all three is the difference between "demo works" and "production survives."
L1 — Type validation at the boundary. Pydantic, dataclass post-init checks, JSON schema validation. If the model promises risk_level is one of four enum values, validate it. If it returns "moderate," fail fast. This is what the Contract Analyzer's _fallback_clauses() is doing reactively — and what a Pydantic model on the response shape would do proactively.
L2 — Retry-with-error-feedback. When L1 fails, the harness retries the call with the model's own validation error appended to the prompt. The model sees its mistake and self-corrects. Cap at N retries (typically 3). Instructor automates this across 15+ providers — 3M+ monthly downloads — and is the de facto Python library for this layer. The Contract Analyzer's "retry with a stricter prompt" pattern is L2 done manually.
L3 — Constrained decoding at the source. The model is constrained at the sampling layer to only emit tokens that match the schema. Native support in Anthropic (tool_choice forcing a specific tool's schema), OpenAI (response_format), Google. Open-source: Outlines, XGrammar (default in vLLM, SGLang, TensorRT-LLM as of March 2026, achieves <40µs/token overhead — essentially free). The structural-extraction pattern (find("[") / rfind("]")) is a poor man's L3 — recovering valid JSON from invalid surrounding text rather than preventing the invalid text in the first place.
A production case study: 8% unparseable + 5% wrong-types baseline → 0.3% unparseable, 0% wrong-types with the full L1+L2+L3 stack in place.
Three rules for stacking the layers:
Run all three in production, not one. L3 alone misses semantic validation errors (the schema is satisfied but the values are nonsense). L1 alone gives up on recoverable failures. L2 alone burns retries on failures L3 would have prevented. The cost of running all three is low; the cost of running only one is the 5% production tail that won't go away.
Don't write code for failures that can't happen. If L3 guarantees valid JSON, don't write a defensive parser for malformed JSON. If L1 enforces enums at the boundary, don't write downstream code that handles "moderate" alongside the four valid values. Stale defensive code rots faster than load-bearing code.
Make the fallback degraded, not absent. When all three layers fail (which they will, eventually), return a structured fallback rather than an exception. The Contract Analyzer's _fallback_clauses() returns 41 clauses all marked found: False with "Unable to analyze" — the user sees the analysis failed, but the page still renders. A 500 error renders nothing.
Defensive parsing is unsexy. It's also what makes the difference between a demo that works on the happy path and a system that survives in production. The prompt can't reach down to do this work.
Sources: Instructor docs, Outlines, XGrammar, Anthropic — Tool use forcing.
Technique 5: Hooks and Lifecycle Events#
A common agent failure mode is: the loop never stops. The agent calls a tool, processes the result, decides it needs another tool, calls that, processes that result, and keeps going forever — racking up tokens and minutes. The naive fix is to put "stop when you have enough information" in the system prompt. That fix doesn't work, because the model's judgment of "enough" is exactly what's broken.
The right fix is a hook — deterministic code that runs after every step and decides whether to continue:
def step_hook(step_count, accumulated_cost, accumulated_sources): if step_count >= MAX_STEPS: return STOP if accumulated_cost >= COST_CEILING: return STOP if len(accumulated_sources) >= MIN_SOURCES_FOR_SYNTHESIS: return SYNTHESIZE # We have enough; force the next step return CONTINUE
The model is never asked. The harness decides. That's the whole point.
This is the same general pattern that makes the Critique Agent in Deep Research work. The critique agent is a separate model call after the parallel researchers complete:
You are a Research Critique Agent. Review research findings for quality
and completeness.
Evaluate:
1. Coverage
2. Accuracy
3. Recency
4. Bias
5. Gaps
Output: PASS / REFINE / FAIL
The critique agent emits one of three deterministic verbs. The harness reads that verb and decides what to do next: synthesize on PASS, dispatch a second-pass researcher on REFINE, abort on FAIL. The model produces an opinion; the harness turns the opinion into an action. Separating those two responsibilities is what makes the agent reliable.
Two operational rules I now follow for every agent:
- Set a step ceiling. No matter what, the agent loop terminates after N iterations. Pick N to be 2-3x what a healthy run requires.
- Set a cost ceiling. No matter what, the agent stops if it has burned $X in tokens. Cost runaway is the worst failure mode because it's silent until the invoice arrives.
Both are harness, not prompt. The model can ignore "please don't loop forever." It cannot ignore a Python break statement.
The 2026 vocabulary: lifecycle events#
The pattern I described above as "a hook that runs after every step" has been formalized as a named, standardized primitive in modern agent SDKs. The clearest taxonomy comes from Claude Code, but the primitive itself is portable — OpenAI Agents SDK calls them Guardrails, Mastra calls them workflow hooks, LangGraph calls them interrupt nodes. Same idea, different syntax.
The Claude Code lifecycle events:
| Event | When it fires | Typical use |
|---|---|---|
UserPromptSubmit | User submits a new prompt | Input validation, redaction, intent classification |
PreToolUse | Before a tool is dispatched | Permission checks, cost gates, argument validation; can return approve or deny |
PostToolUse | After a tool returns | Output filtering, side-effect logging, metric recording |
PreCompact | Before context compaction fires | Snapshot state to external file before in-context summary replaces it |
Stop | Main agent's loop ends | Cleanup, final logging, billing |
SubagentStop | A sub-agent finishes | Aggregate sub-agent metrics, cleanup per-sub-agent state |
Notification | Agent needs human attention | Slack/email alerts, paging |
The architectural shift the names encode: deterministic policy moves out of the prompt ("please don't call the dangerous tool") and into the harness ("PreToolUse returns deny if tool == dangerous_tool"). A PreToolUse hook that returns deny is more powerful than the post's earlier step_hook example: it blocks the tool call before it executes, rather than detecting a problem after the fact.
The lifecycle vocabulary also points at the boundary the post is built on. Notice that no Claude Code hook is named PreMessageContent or EnforcePromptStyle. The hook system can't reach into what the model says — only into what happens around the model's actions. That's the harness/prompt boundary, made operational.
Sources: Claude Code — Hooks reference, Anthropic Agent SDK — Hooks.
Cost-aware routing and budget enforcement#
The cost ceiling above is a static stop condition: "halt at $X." The 2026 extension is dynamic cost-aware routing — at >90% budget utilization, route the next call to a cheaper model rather than aborting outright. AI gateways (LiteLLM, Bifrost, Helicone) implement this as a proxy layer between the agent and the model providers: the agent code stays unchanged, the routing policy is configured externally.
The headline numbers: prompt caching (covered in Technique 1) gives 45-80% spend reduction on stable-prefix workloads; semantic caching + budget routing gives another ~47% on top of that. Both are pure harness work; neither requires touching the agent logic.
For the Contract Analyzer, the realistic version is: route the extraction call to Claude Haiku 4.5 (the current production choice — fast, cheap, high enough quality) but allow fallback to Sonnet 4.6 on validation failure or when the user explicitly requests "deep analysis." That decision is harness configuration, not agent code.
Permissions: The Model Proposes, the Harness Disposes#
If hooks are the event primitive, permissions are the policy primitive. Every serious 2026 agent harness ships with permission modes — declarative rules about what tools the agent can call without confirmation, what it can call with confirmation, and what it can't call at all.
Claude Code's five modes are the clearest worked example:
| Mode | Behavior | When to use |
|---|---|---|
default | Prompt for confirmation on first use of every tool; remember subsequent approvals per-session | Interactive development |
acceptEdits | Auto-approve file edits; still prompt for Bash, WebFetch, MCP tools | Trusted editing workflows |
plan | Read-only; agent can analyze but not modify anything | Reviewing what an agent would do before letting it run |
dontAsk (allow-list) | Pre-approve specific tools/patterns via settings; deny everything else | Production agents with tight tool scopes |
bypassPermissions | Auto-approve everything | CI/CD, sandboxed dev containers — never on a real machine |
Two distinctions worth marking:
Permissions ≠ sandboxing. Permissions gate which tools the harness lets the model call. Sandboxing constrains what those tools can do at the OS level — filesystem allow-lists for Read/Edit, network restrictions for Bash, container isolation for code execution. Both are harness concerns; they fail in different ways. A permission misconfiguration lets the model invoke a dangerous tool; a sandboxing misconfiguration lets a properly-permitted tool do dangerous things.
The principle has a name now. "The model proposes, the harness disposes" is exactly the framing the 2026 literature uses. The model is the suggestion engine — it proposes a tool call, an action, a next step. The harness is the enforcement engine — it decides what to actually let happen. This split is what makes agents safe enough to ship, and it's what every permission mode and lifecycle hook in this section is operationalizing.
For Contract Analyzer in a multi-tool legal-research extension, the realistic permission shape: pre-approved extract_clauses, score_risk, summarize_contract (read-only on the uploaded document); always-prompt for email_lawyer (side effect); always-deny for modify_contract (out of scope for this product).
Sources: Claude Code — Configure permissions, Anthropic — Claude Code auto mode.
When the Loop Takes Hours: Durable Execution#
The Deep Research keepalive pattern (Technique 3) handles agent runs that last 30-90 seconds. It does not survive multi-hour or multi-day agents — the kind that run overnight data pipelines, process long batches, or wait for human approval mid-loop.
The 2026 answer is durable execution: the harness is built on a workflow engine that checkpoints state to durable storage between steps, so the agent can resume from where it left off after a crash, a restart, or a multi-hour pause.
The three implementations that matter:
- Temporal — the incumbent. Workflows as code, deterministic replay, signal-and-wait primitives. At Replay 2026, Temporal announced serverless Workers on AWS Lambda explicitly targeting the agentic-AI use case.
- Inngest — TypeScript-first, event-driven. Strong integration with Mastra.
- AWS Lambda Durable Functions — announced 2026, supports runs up to one year with checkpoint-and-replay. The native AWS option for agent workloads on AgentCore.
The mapping to agent concepts is clean and the reason this is the right shape:
- Every tool call → a workflow activity (checkpointed result)
- Every retry → a workflow saga step (with automatic compensating actions)
- Every human-in-the-loop pause →
await wait_for_event("user_response") - Every cost ceiling breach → a saga compensating action that rolls back resources the agent provisioned
This is the gap between "agent that does a research query" (the post's running scenario) and "agent that runs a six-hour ETL with three human approvals." For Contract Analyzer, durable execution would matter if we built a contract negotiation agent that drafts revisions, sends them to the counterparty, waits days for a response, and resumes when one arrives.
Sources: Anthropic — Effective harnesses for long-running agents, AWS — AgentCore and Temporal, Lambda Durable Functions overview.
Skills: Capability Packaging#
A new harness primitive in 2026, distinct from tools and from prompts: Anthropic's Skills. The public Skills Marketplace launched on May 1, 2026 with ~600 skills.
A Skill is a directory containing a SKILL.md file (YAML frontmatter: name + description), optional scripts, and optional resource files. The killer feature is progressive disclosure:
- Initially, only the ~1,024-character description loads into the model's context
- When the model decides the skill is relevant, the full
SKILL.mdinstructions load - When a specific subtask needs them, nested resources load
That's a different shape from tools (functions the model calls) and a different shape from prompts (always-loaded text). A Skill is a capability bundle the model can discover, evaluate, and lazy-load on demand. The token cost of having 50 skills available is the cost of 50 descriptions (~50KB). The cost of having 50 tools is the cost of 50 full schemas (~150KB+).
For Contract Analyzer extended into a legal-tech super-agent, the right shape is probably to package the contract-analysis pipeline as a Skill (with extract_clauses.py, score_risk.py, the CUAD taxonomy as a resource file, and a SKILL.md explaining when to invoke each step) rather than as a flat list of tools. The agent loads the description always; it loads the full skill only when a contract needs analyzing.
This is the newest pattern in this post, and it's the one most likely to evolve in the next twelve months. Worth tracking.
Sources: Anthropic — Equipping agents with Agent Skills, Agent Skills overview.
Technique 6: Tracing — The Harness Has to Be Observable#
The final harness technique is the one that makes everything else debuggable: tracing.
A multi-agent system where one of four parallel researchers silently produces low-quality results is invisible without tracing. The aggregate output looks fine. The cost looks normal. Only by inspecting the individual sub-agent's spans do you see that researcher_2 took 4.5 seconds for a sub-query that should have taken 1.5, used 4x the expected tokens, and returned three sentences of generic prose. Tracing surfaces that.
The Agent Observability project (full post) builds this on OpenTelemetry with a hierarchical span structure that mirrors the agent loop:
research_pipeline (12,400ms)
├── decomposition (800ms)
│ └── bedrock_call (780ms)
│ gen_ai.request.model: claude-haiku-4.5
│ gen_ai.usage.input_tokens: 420
├── researcher_1 (4,200ms)
│ ├── tavily_search (1,100ms)
│ ├── tavily_search (900ms)
│ └── bedrock_call (1,150ms)
│ gen_ai.usage.input_tokens: 2,800
├── researcher_2 (3,800ms)
├── researcher_3 (3,600ms)
├── researcher_4 (4,100ms)
├── critique (1,100ms)
└── synthesis (2,100ms)
└── bedrock_call (2,050ms)
gen_ai.usage.input_tokens: 4,200
Three properties of this trace that matter for harness debugging:
The hierarchy mirrors the harness structure, not the prompt. Sub-agent dispatch creates parent-child spans. Tool calls create grandchild spans. The shape of the trace tells you the shape of the harness — which is exactly what you need when something's wrong.
Span attributes use the gen_ai.* convention. Standard OpenTelemetry GenAI semantic conventions for LLM calls (gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens). Client spans exited experimental status in early 2026; agent and framework spans remain experimental but stable in practice. Every observability tool that speaks OTel — Datadog, Honeycomb, New Relic, Langfuse, Phoenix — understands these. LangChain, CrewAI, AutoGen, and most modern SDKs emit OTel-compliant spans natively. Don't invent your own attribute names; the convention is the win.
Aggregate stats over recent traces drive the dashboard. p50 and p95 latency, average tokens, total cost, average quality score. These are the harness-level metrics that tell you whether your system is healthy in production. Per-trace inspection tells you what's wrong with a specific run; aggregate stats tell you whether the system is drifting.
The general rule: instrument the harness, not just the model call. If you can only see "the model was called and returned in 4.2 seconds," you can't tell whether the 4.2 seconds was network, tool dispatch, or actual generation. Hierarchical spans are how you tell.
Evaluation as a harness concern: LLM-as-judge#
The Critique Agent in Technique 3 already showed the pattern, but it's worth naming explicitly: it's an LLM-as-judge. A separate model call evaluates a previous model call's output along defined dimensions (coverage, accuracy, recency, bias, gaps), returning a structured verdict the harness acts on.
The 2026 eval-tooling landscape that productionizes this pattern:
- Braintrust — eval-driven development with CI/CD gates; "quality management system" framing
- Langfuse — open-source, acquired by ClickHouse Jan 2026, mature judge harness
- LangSmith — LangChain-native
- Arize Phoenix — OTel-native open source, drift detection
The architectural point that's load-bearing for this series: evaluation is partially a harness concern (the eval suite runs in code, integrates with CI, gates deployments) and partially outside it (the design of the evals, the ground-truth labels, the judging rubric). Inline LLM-as-judge — Critique-Agent-style — is the harness-resident slice. Periodic regression-suite runs against a held-out eval set, gating production deploys, is the broader CI/CD slice.
Either way: it's not a prompt concern. The prompt asks; the eval verifies. Both are required.
Sources: Latitude — Best LLM Observability Tools 2026, Phoenix — LLM as a Judge.
Same Query, Three Rewrites — All Three Columns Filled#
The full picture. The prompt is unchanged from Post 2. The context is unchanged from Post 3. The harness column is new.
SCENARIO: A lawyer uploads a 60-page MSA. Extract every clause against
the 41-type CUAD taxonomy, score risk 0-100, output JSON. Also: relate
liability cap to indemnification, flag the top 3 risks in prose.
┌─────────────────────┬─────────────────────┬─────────────────────┐
│ PROMPT LAYER │ CONTEXT LAYER │ HARNESS LAYER │
├─────────────────────┼─────────────────────┼─────────────────────┤
│ System: precise │ Input pipeline: │ Invocation: │
│ legal AI, JSON │ - PDF → text via │ - max_tokens=8192 │
│ only │ pdfplumber │ - temperature=0.1 │
│ │ - Truncate at │ (extraction) │
│ User prompt: │ 80,000 chars │ - temperature=0.3 │
│ - 4-level risk │ - Format 41 clause │ (summary) │
│ rubric with │ types as │ - timeout=30s │
│ perspective │ "[Category] Name" │ │
│ - 41-clause │ list │ Orchestration: │
│ taxonomy │ - Inject taxonomy + │ - Extract pass │
│ - Schema: 6 fields, │ contract into │ (1 call) │
│ types & enums │ EXTRACTION_PROMPT │ - Cross-clause │
│ specified │ template │ pass (sub-agents │
│ - IMPORTANT-caps │ │ per dep pair) │
│ reinforcement │ Total context: │ - Summary pass │
│ │ ~20,000 tokens │ (separate call, │
│ Second-call │ (was ~60,000) │ different persona,│
│ system prompt: │ │ temp=0.3) │
│ senior legal │ │ │
│ analyst writing │ │ Output handling: │
│ executive │ │ - Defensive JSON │
│ summaries │ │ parser (find "[" │
│ │ │ rfind "]") │
│ │ │ - Pydantic schema │
│ │ │ validation │
│ │ │ - Fallback clauses │
│ │ │ on parse failure │
│ │ │ │
│ │ │ Observability: │
│ │ │ - OTel spans for │
│ │ │ each pass │
│ │ │ - gen_ai.usage.* │
│ │ │ for token/cost │
│ │ │ │
│ │ │ Streaming: │
│ │ │ - SSE keepalives │
│ │ │ every 10s │
└─────────────────────┴─────────────────────┴─────────────────────┘
OUTCOME: 100% parseable extractions (defensive parser closes the
last 2%). Cross-clause reasoning enabled by second-pass sub-agents.
Executive summary has voice (temp 0.3) without hallucinating
sections (anchored to structured input). Every run traced end-to-end.
Total cost: ~$0.008 per contract. Total latency: ~28s, with
keepalives so the user never sees a stalled UI.
The Contract Analyzer scenario is fully solved — and the solution required all three layers. Each layer fixed what the layers below it couldn't. Each layer left work for the layer above. That's the entire argument of this series in one diagram.
What Harnesses Can't Fix#
Two failure modes the harness layer can't reach.
1. A bad prompt at the bottom still poisons the top. If the system prompt instructs the model to be a sassy pirate, no amount of tracing, retry logic, or sub-agent orchestration will get you a legal analysis. The layers compose; they don't replace. The harness can recover from a 2% conversational preamble; it can't recover from a fundamentally wrong instruction.
2. The "the model is wrong but confident" failure. A harness can detect that the model returned malformed JSON. A harness cannot detect that the model returned beautifully-formatted JSON that is also wrong on the facts. That's an evaluation problem — and evaluation is its own discipline, partially handled by the harness (the LLM-as-judge pattern in Agent Observability does this, but with its own LLM call) and partially by the application layer (humans reviewing samples, regression test suites, ground-truth comparisons).
Post 5 addresses how to decide which layer your bug is in. The short answer: read the symptom precisely, then climb. Most "the model is wrong" bugs are actually "the harness gave the model bad information," and most "the agent loops forever" bugs are missing a hook.
What's Next#
In Blog 5: Picking the Right Layer, I'll close the series with the diagnostic framework: symptom-to-layer mapping, the cost-of-iteration argument for why most teams default to the wrong layer, case studies pulled from all five referenced projects (Contract Analyzer, Clinical Research, Context Engine, Deep Research, Agent Observability), and a brief look at how the boundaries shift as reasoning models, long-context, and agent SDKs mature. The framework is short. The hard part is following it when prompt-rewriting feels like the path of least resistance.
This is post 4 of 5 in the Three Layers of AI Engineering Series. The full series covers prompt, context, and harness as three distinct engineering disciplines, with case studies drawn from five shipped projects.
Referenced projects:
- Deep Research Agent: github.com/MinhQuanBuiSco/deep-research-agent
- Agent Observability: github.com/MinhQuanBuiSco/agent-observability
- Contract Analyzer: github.com/MinhQuanBuiSco/contract-analyzer
- Context Engine: github.com/MinhQuanBuiSco/context-engine
- Clinical Research: github.com/MinhQuanBuiSco/clinical-research-agent