Three Layers of AI Engineering — Blog 5: Picking the Right Layer — A Decision Framework#
The first four posts of this series made the case for treating prompt, context, and harness as three distinct engineering disciplines. This post is the operating manual: when something goes wrong with an AI system, how do you decide — in thirty seconds, before you've sunk an afternoon — which layer your bug is on?
The answer matters because the cost of iterating on each layer is wildly different. Prompt edits cost seconds. Context changes cost hours (you're touching retrieval, possibly the data pipeline). Harness changes cost days (you're modifying the agent loop, the orchestration, the deployment). That cost gradient creates a powerful incentive to default to prompt rewrites — which is exactly how I lost two days on a max_tokens=2048 bug in Contract Analyzer.
What follows is the decision framework I now use, the five case studies that taught me it, and a brief look at how the boundaries between the layers will shift over the next few years.
The Three Layers of AI Engineering Series#
| Part | Title | Focus |
|---|---|---|
| 1 | Why I Stopped Calling It Prompt Engineering | The three-layer thesis, the running scenario, the decision question |
| 2 | Prompt Engineering — What's Still in Scope | System-as-contract, schema-first, few-shot, negative space |
| 3 | Context Engineering — Curating What the Model Sees | Complexity, compression, dedup, memory, lost-in-the-middle |
| 4 | Harness Engineering — The Loop Around the Model | Tools, sub-agents, hooks, schedulers, parsers, memory, observability |
| 5 | Picking the Right Layer: A Decision Framework (this post) | Symptom-to-layer mapping, case studies, the 2026 picture |
The Symptom-to-Layer Map#
Read the symptom precisely. Don't paraphrase. The exact words matter — "the model is wrong" and "the model is missing information" are different symptoms with different fixes.
"The model misunderstood what I wanted."#
→ Prompt layer. The instruction is ambiguous or under-specified. Tighten the schema. Make the rubric explicit. Add a negative example. Add a perspective clause.
Example from Contract Analyzer: Non-compete clauses came back "low risk" because the rubric didn't specify whose perspective to assess from. Fix was three words: "for the party receiving/signing this contract." This is the canonical prompt bug — the model did exactly what you asked, and what you asked was not quite what you meant.
"The model gave the right kind of answer about the wrong thing."#
→ Context layer. The retrieval pulled the wrong chunks. Or the right chunks got buried by noise. Or the input is too long for the model to reliably attend across.
Example from Contract Analyzer: On 60-page contracts, clauses near the top of the document were marked found: false even though they were clearly present. The model wasn't refusing to look — it couldn't reliably attend across 60k tokens of boilerplate. The fix was a context-layer truncation rule, not a prompt rewrite.
"The model gave a partial answer / no answer / a malformed answer."#
→ Harness layer. Check max_tokens. Check that the tool call actually dispatched. Check the parser. Check the stop condition.
Example from Contract Analyzer: 27 clauses returned instead of 41, with an unparseable trailing {. No error, no warning — silent runtime truncation. The fix was max_tokens=8192 (calculated, not guessed). I rewrote the prompt three times before noticing it was a harness bug.
"The model contradicts itself across turns."#
→ Harness layer (no memory) or context layer (history not included). Stateless agents will contradict themselves; agents whose conversation history isn't included in the context window will too. Both are recoverable, neither is a prompt issue.
"The model is right, but it costs too much / takes too long."#
→ Context layer. The model is doing fine; you're paying for tokens you don't need. This is the entire Context Engine thesis: complexity classification to right-size retrieval, key-sentence compression to cut search-result tokens, semantic memory to skip work, dedup to prevent reading the same finding twice.
The Context Engine cut a "What is Python?" query from 12,400 tokens to 900 tokens. 93% reduction. The prompt never changed.
"The model is right on simple inputs and falls apart on complex ones."#
→ Usually context (lost-in-the-middle) or harness (need sub-agent decomposition). Rarely prompt. Complex inputs either overload attention (context fix) or need to be broken into sub-problems and reasoned about separately (harness fix).
Example from Deep Research Agent: Single-shot research on "compare React, Vue, and Angular" produced a hedged, generic answer. Decomposing into 4 parallel sub-queries with their own researchers — a harness change, not a prompt change — produced specific answers per framework that the synthesizer could compare cleanly.
"The agent loops forever / costs too much per run."#
→ Harness layer. The agent has no deterministic stop condition. Add a hook with step ceiling and cost ceiling. Don't try to talk the model into stopping with a prompt.
"The model gave a confidently wrong answer."#
→ Not directly fixable at any of these three layers — escalate to evaluation. A harness can detect malformed output but not confidently-wrong output. The fix lives in: ground-truth evaluation suites, LLM-as-judge scoring, human spot-checks, retrieval grounding (force every claim to cite a source). The Clinical Research safety layer is an evaluation system more than a prompt or harness one.
"The model started doing this thing recently and it wasn't doing it before."#
→ Anywhere — but check the model version first. Provider updates change behavior. A prompt that worked on claude-haiku-4.5 may behave differently on claude-haiku-4.6. The fix isn't in any of the three layers; it's in pinning the model version (a harness configuration) and re-validating the prompt against the new model.
The Iteration-Cost Trap#
Here's the perverse incentive that makes the decision framework hard to follow in practice.
| Layer | Time to change | Time to validate | Cognitive load |
|---|---|---|---|
| Prompt | seconds | one call (~5s) | low |
| Context | hours | one full pipeline run (minutes) | medium |
| Harness | hours to days | full integration test (minutes to hours) | high |
When you're staring at a bug at 11pm, the prompt layer is the path of least resistance. Edit the string, re-run, see if it helped. The dopamine hit of fast iteration is real, and it pulls you toward prompt-layer fixes whether or not the bug is actually on the prompt layer.
I lost two days on the max_tokens=2048 bug in Contract Analyzer because of this. Every prompt rewrite was a 5-second iteration. Every "still wrong" feedback was immediate. It felt like I was making progress. I wasn't — the prompt was never the problem. Two days of fast iteration on the wrong layer is much worse than thirty minutes of slow iteration on the right one.
Three operational habits that help:
1. Before rewriting the prompt, write down the exact symptom in one sentence. Then run that sentence through the symptom-to-layer map above. If the layer your symptom maps to isn't "prompt," put the prompt down.
2. Time-box prompt iteration. If you've made three prompt edits and the bug hasn't moved, the bug almost certainly isn't on the prompt layer. Stop and look one layer up.
3. Treat the prompt layer as the most expensive to get right, even though it's the cheapest to change. A prompt that's 98% reliable looks like it works. A prompt that's 50% reliable looks broken. The 2% in between is where systems go to die — and where chasing the prompt becomes the dominant trap.
Five Case Studies, One Layer Each#
The bugs that taught me each layer, one per project.
Contract Analyzer — Prompt Layer#
Bug: Model classified non-compete clauses as "low risk."
Wrong fix: Adding more context about employment law to the prompt.
Right fix: Three words in the rubric: "for the party receiving/signing this contract."
Lesson: Asymmetric stakeholders need explicit perspective in the prompt. The fix is at the prompt layer because the input data and the harness were both fine — the instruction was under-specified.
Context Engine — Context Layer#
Bug: Every query — "What is Python?" included — went through the full 4-sub-query, 5-results-each research pipeline. 12,400 tokens for a question that needed 300.
Wrong fix: Trying to get the model to "answer concisely" via prompt engineering.
Right fix: A rule-based complexity classifier that routes simple queries through a 1-sub-query path. 93% token reduction.
Lesson: When the model is being asked to ignore most of its context, the bug is the context, not the prompt. Don't make the model triage your input — triage it yourself.
Clinical Research Agent — Prompt + Evaluation Layer#
Bug: Model would occasionally make uncited medical claims like "the recommended dose is 500mg twice daily" — plausible but dangerous if wrong.
Wrong fix: "Please cite your sources" in the system prompt. Reduces but doesn't eliminate.
Right fix: A post-generation safety layer that pattern-matches claim shapes (_MEDICAL_CLAIM_PATTERNS) against accepted citation formats (PMID:, NCT, [1], etc.) and flags uncited claims by severity. The prompt instructs; the harness enforces.
Lesson: For safety-critical outputs, no amount of prompt engineering reaches the reliability of post-hoc validation. The prompt layer asks; the harness verifies. Both are required.
Deep Research Agent — Harness Layer#
Bug: A single-shot research call on "compare 3 frameworks" produced a generic, hedged answer.
Wrong fix: A more elaborate prompt with explicit comparison structure.
Right fix: Sub-agent decomposition. Orchestrator splits the query into per-framework sub-queries, dispatches 4 parallel researchers (asyncio.gather), critique agent reviews, synthesizer consolidates. 6 agent calls, ~20 seconds total. The output is per-framework specific because each researcher was scoped to one framework.
Lesson: Comparison and decomposition are harness problems, not prompt problems. The harness shape — what work runs in parallel, what runs sequentially, what passes data to what — is what changes the output quality.
Agent Observability — The Layer-That-Lets-You-Find-Bugs#
Bug: Total p95 latency was 18 seconds — too slow — but it wasn't clear why. The aggregate looked uniform across runs.
Wrong fix: Trying to speed up the synthesis prompt.
Right fix: Hierarchical OpenTelemetry tracing with gen_ai.* span attributes. The first trace showed one of the four researchers consistently taking 4.5 seconds vs ~3 seconds for the others — using 4x the expected tokens for a sub-query that should have been simple. The bug was a poorly-scoped sub-query (a context-layer bug); the harness's job was just to make the bug visible.
Lesson: Tracing isn't a layer that fixes bugs. It's a layer that reveals which layer the bug is on. Without it, you're guessing. With it, the symptom-to-layer mapping above becomes trivial — you just read the trace.
The 2026 Picture: Where the Boundaries Shift#
Three trends are actively redrawing the lines between the layers. Worth tracking.
1. Reasoning models reduce the prompt layer's surface area. Models that do extended thinking before responding (Claude's reasoning mode, GPT's o-series) take some of the work the prompt used to do. Chain-of-thought used to be a prompt technique; in reasoning models, it's the default behavior. Schema enforcement used to be a prompt technique; constrained decoding makes it a harness configuration. The prompt is shrinking, the harness is growing.
2. Long-context models do not eliminate context engineering. A 2-million-token window is not an excuse to dump 2 million tokens in. The lost-in-the-middle tax doesn't go away with more context — it gets worse. Long-context models increase the value of context engineering because the cost of getting curation wrong gets bigger.
3. Agent SDKs standardize the harness layer. Strands, LangGraph, the Claude Agent SDK, Anthropic's Managed Agents — all of them are doing the harness work (tool dispatch, sub-agent orchestration, retries, observability) as a framework concern rather than as code you write yourself. This is mostly good. It also means harness bugs increasingly show up as framework-version mismatches and dependency upgrades, which require a different debugging muscle than custom orchestration code.
The net effect: the prompt layer will keep getting smaller, the context layer will keep getting more important, and the harness layer will keep getting more standardized but more critical when it breaks. The three-layer split survives all three trends — the amount of work on each layer shifts, but the discipline boundaries don't move.
Final Diagnostic — A Closing Checklist#
Before you reach for any layer, ask yourself these in order:
- What is the exact symptom, in one sentence? (Not paraphrased. Verbatim.)
- Which layer does that symptom map to? (Use the table above.)
- Have I changed the prompt three times without progress? (If yes, stop. The bug isn't on the prompt layer.)
- Is there a
max_tokens, temperature, timeout, or retry config I haven't checked? (If you can't recite the current values, that's where to look first.) - Is the input the model actually sees the input I think it sees? (Log the literal context window. Half the time, you'll find the answer there.)
- Is there a trace of this run? (If not, fix that before fixing anything else.)
The framework is short on purpose. It has to be runnable in thirty seconds, at 11pm, when the prompt-rewrite tab is calling your name.
Closing: Three Layers, One System#
The case I've made across these five posts can be compressed into a single claim: AI engineering is three disciplines, not one, and treating them as one is what makes most AI bugs intractable.
The prompt layer is necessary and bounded. It is not the whole of AI engineering, and pretending it is sets you up for two-day debugging sessions on bugs a one-line config change would have fixed.
The context layer is where you fight the lost-in-the-middle tax, where token economics gets decided, and where the model goes from "doing fine on toy inputs" to "doing fine on production inputs."
The harness layer is where a model becomes an agent. It's also where most production AI systems live or die — because reliability, observability, and cost control all live there.
Each layer has its own techniques, its own failure modes, its own iteration costs. Each layer has clear boundaries with the others. And once you can see the boundaries, the work of debugging — and the work of building — becomes a question of which layer, not "what's wrong with my prompt."
The Contract Analyzer running scenario from this series ended up using all three: a 200-line extraction prompt with closed-vocabulary classification, a context pipeline with PDF parsing and 80k-character truncation, a harness with sub-agent decomposition and defensive parsing and OpenTelemetry tracing. None of those pieces could have done the others' jobs. None of them, alone, would have been enough.
The discipline isn't picking the right one. The discipline is knowing which one you're working on.
This is post 5 of 5 in the Three Layers of AI Engineering Series. The full series covers prompt, context, and harness as three distinct engineering disciplines, with case studies drawn from five shipped projects.
Referenced projects:
- Contract Analyzer: github.com/MinhQuanBuiSco/contract-analyzer
- Clinical Research: github.com/MinhQuanBuiSco/clinical-research-agent
- Context Engine: github.com/MinhQuanBuiSco/context-engine
- Deep Research Agent: github.com/MinhQuanBuiSco/deep-research-agent
- Agent Observability: github.com/MinhQuanBuiSco/agent-observability