Three Layers of AI Engineering — Blog 1: Why I Stopped Calling It Prompt Engineering#
I had a contract on my desk and a deadline. The Contract Analyzer prompt was at 98% reliability — and the last 2% was eating me alive.
The model would occasionally rephrase "Termination for Convenience" as "Convenience Termination Clause." It would assess a non-compete as "low risk" — true, if you were the employer. On the fifth test run of an afternoon, it returned 27 clauses instead of 41, with an unparseable trailing { where the rest of the JSON should have been. I rewrote the prompt. I rewrote it again. I added IMPORTANT-caps reinforcement, schema constraints, perspective clauses. The cliff held — but it was always a 2% cliff, not zero.
The fixes that actually got the system to production weren't prompt rewrites. They were a max_tokens setting (the 27-clauses-instead-of-41 bug), an 80,000-character truncation rule (a different bug where clauses near the top of the document were inexplicably flagged "not found"), and a five-line defensive JSON parser that lived completely outside the prompt (the trailing-text bug). None of those are prompt engineering. They're the layers above and below the prompt — and pretending they were prompt problems is what kept me stuck.
That experience is the seed for this series.
The Three Layers of AI Engineering Series#
| Part | Title | Focus |
|---|---|---|
| 1 | Why I Stopped Calling It Prompt Engineering (this post) | The three-layer thesis, the running scenario, the decision question |
| 2 | Prompt Engineering — What's Still in Scope | System-as-contract, schema-first, few-shot, negative space |
| 3 | Context Engineering — Curating What the Model Sees | Complexity, compression, dedup, memory, lost-in-the-middle |
| 4 | Harness Engineering — The Loop Around the Model | Tools, sub-agents, hooks, schedulers, parsers, memory, observability |
| 5 | Picking the Right Layer: A Decision Framework | Diagnostic flowchart, iteration cost per layer, case studies across five projects |
Each post revisits the same Contract Analyzer scenario from a different layer. By post 4, the reader sees the same input solved three ways — same prompt, different context, different harness — with the token counts and outcomes to prove the boundaries are real.
What "Prompt Engineering" Got Me, and Where It Stopped#
I want to be precise about what worked, because the case I'm about to make isn't against prompt engineering. It's against the marketing definition of it — the one where "prompt engineering" means "everything you do to make an LLM behave."
Here's what the prompt layer actually fixed in Contract Analyzer.
The closed-vocabulary fix. Early prompts asked the model to classify clauses against the 41-type CUAD taxonomy. Without an explicit constraint, the model would rephrase types in defensible but unjoinable ways: "Termination for Convenience" became "Convenience Termination Clause" on one run, "Convenience-Based Termination" on another. Each variant is correct English. Each one also breaks the downstream join against the taxonomy and turns the whole pipeline into garbage. Adding the parenthetical "clause_type": string (exact name from the list above) to the schema, plus temperature 0.1, dropped the violation rate from roughly 1-in-5 clauses to 0-in-35 across the test suite. (Full post.)
The perspective rubric. Earlier versions just said "assess the risk level." The model would sometimes assess from the wrong side of the deal — a non-compete came back "low risk" because it is low risk to the employer. The employee about to sign would disagree. The fix was three words: "for the party receiving/signing this contract." Asymmetric stakeholders need an explicit perspective in the prompt, otherwise the model averages across viewpoints, and averaged risk is useless risk.
Belt-and-suspenders JSON. System prompt: Always respond with valid JSON only. User prompt ending: IMPORTANT: Return ONLY the JSON array, no other text. Two independent reinforcements of the same constraint at different points in the conversation. System prompt alone got about 92% JSON-only responses. User prompt alone got 95%. Both together: 98%.
These are real wins. They're also the entire surface area of what "prompt engineering" can do. Past 98%, the curve flattens, and the next failures aren't on the prompt's axis at all.
The Bugs That No Prompt Rewrite Could Fix#
Three failures forced me to stop reaching for the prompt. They sit on different layers, and each one was a quiet failure — the kind that doesn't throw an exception, doesn't fire a log warning, and looks for all the world like a model problem until you look closer.
Bug 1: 27 clauses instead of 41#
My first version of the extraction call used max_tokens=2048 by muscle memory. The fifth test run came back with 27 clauses and a JSON array that ended mid-object with ... "explanation": "This claus. No error. No warning. The model had been cut off mid-output, and the response was unparseable not because the model had hallucinated — because the runtime had silently truncated it.
There is no prompt that fixes this. You can write IMPORTANT: complete all 41 clauses in every capital letter you know, and the runtime will still cut you off at token 2048. The fix is a one-line change to the model configuration: max_tokens=8192. That's a harness concern — the runtime parameters that govern how the model is invoked, not what the model is told.
The general rule is this: estimate output budget as (tokens-per-item × cardinality × 1.5) before you write the call. For 41 clauses at ~150 tokens each, you need ~9,200 tokens of headroom. 2048 isn't even close. That calculation has nothing to do with prompt craft.
Bug 2: Clauses near the top of long contracts marked "not found"#
On longer contracts (50+ pages), clauses that were clearly present near the beginning of the document would occasionally come back found: false. The model wasn't hitting its context limit — Claude Haiku has a 200k-token window and the contracts were well under that. It could see the text. It just wasn't finding the answer there.
This is the "lost in the middle" effect: model attention degrades non-uniformly across long contexts, and the degradation kicks in well before the physical limit. The fix in Contract Analyzer was a truncation rule — cut input at 80,000 characters (~20k tokens), because past page 30 of a commercial contract you're in definitions, schedules, and boilerplate anyway. The clauses that actually matter sit in the first half.
Again: no prompt fixes this. The model isn't refusing to look; the model can't reliably attend to what you've buried under 60k tokens of boilerplate. The fix is to curate the context window before the model sees it — which is the entire discipline of context engineering. (Full version in the Context Engine series.)
Bug 3: "Sure! Here is the analysis:"#
About 2% of runs came back with conversational preamble around the JSON: "Sure! Here is the analysis:" followed by the array, followed by "Let me know if you need anything else." json.loads() throws on the preamble. I tried four different prompt rewrites to suppress it. The success rate ticked up by about 0.3% each time. It plateaued.
The actual fix was five lines of Python outside the model call:
start = response_text.find("[") end = response_text.rfind("]") + 1 if start == -1 or end == 0: raise ValueError("No JSON array found in response") json_str = response_text[start:end]
That parser is not part of the prompt. It's part of the harness — the deterministic code that wraps the model and turns its probabilistic output into something the rest of the system can trust. The lesson generalizes: when you're chasing the last percent of reliability with prompt edits, you're paying with hours that a 5-line wrapper would solve in minutes.
Three Layers, Not One#
What the three bugs share is this: each one looks like a "the AI did something wrong" failure, and each one is actually a layer below or above the prompt. Once you see them as separate layers, the fix becomes obvious. Once you don't, you spend a week rewriting prompts.
Here's the model I now use. The "Example artifacts" column is a 2026 snapshot — read it as inventory, not closed list:
| Layer | What you control | Example artifacts | When it's the bottleneck |
|---|---|---|---|
| Prompt engineering | The instruction string itself | System prompt, user prompt, JSON schema, few-shot examples, persona, output format, rubric, negative space | Single-turn tasks where the input is fixed and the model behavior needs tuning |
| Context engineering | What enters the context window before the LLM thinks | Retrieved chunks (RAG), hybrid retrieval + reranking, conversation history, tool definitions (often via MCP), agent memory (episodic/semantic/procedural), prompt caching breakpoints, context compaction/summarization, window ordering, truncation rules, compression, deduplication, multimodal inputs | Multi-source inputs, RAG, agents with state, anything where token budget or cache-hit rate matters |
| Harness engineering | The runtime loop around the model | The agent loop itself; decoder params (max_tokens, temperature, thinking budget); Agent SDKs (Claude Agent SDK, OpenAI Agents SDK, LangGraph, Mastra); MCP servers; tool dispatch, sub-agent topologies (orchestrator-worker, swarm, hierarchical); hooks (PreToolUse, PostToolUse, SubagentStop); permissions and sandboxing; output validators (Pydantic, Instructor, constrained decoding); retries; durable execution (Temporal, Lambda Durable Functions); persistent memory backends (Mem0, Letta, Zep); streaming events; cost tracking; OpenTelemetry GenAI tracing; LLM-as-judge evaluation | Multi-turn agents, autonomous workflows, long-horizon tasks, anything stateful, anything that needs reliability guarantees |
Three properties of this stack matter.
The layers compose, they don't replace. A perfect harness can't rescue a confused prompt. A surgical prompt can't rescue a bloated context window. You pay the cost of all three. The good news is the layers are mostly independent — you can iterate on each one without breaking the others.
Each layer has a different iteration cost. Prompt edits cost seconds. Context changes cost hours (you're touching retrieval, compression, often the data pipeline). Harness changes cost days (you're modifying the agent loop, the orchestration, the deployment). That cost gradient creates an incentive to default to the prompt — and that incentive is exactly what kept me rewriting prompts for two days on a problem max_tokens=8192 would have solved in two minutes.
Each layer has a different debug signature. A prompt bug shows up as "the model misunderstood." A context bug shows up as "the model didn't have the information" or "the model lost the information in the middle." A harness bug shows up as "the model was cut off / never called / called too many times / called with the wrong tool." Post 5 of this series is a full diagnostic flowchart, but the headline is: read the symptom carefully before you reach for any layer.
What Lives Where — One Annotated Agent Call#
The clearest way to ground the three layers is to look at a single LLM call and label every component by layer.
Here's the Contract Analyzer extraction call, with each piece tagged. The diagram shows what's actually in the production code today; the parenthetical notes flag pieces a 2026 production agent would typically add on top:
┌─ HARNESS LAYER ─────────────────────────────────────────────────┐
│ Invocation │
│ • BedrockModel(model_id, region, max_tokens=8192, temp=0.1) │
│ • Agent(model, system_prompt=...) — Strands SDK │
│ • agent(prompt) → result │
│ • response_text = str(result) │
│ • (2026 add-on: cache_control breakpoints on stable prefix, │
│ extended-thinking budget, batch API for non-realtime work) │
│ │
│ Output handling │
│ • Defensive JSON parser (find "[" / rfind "]") │
│ • Pydantic schema validation │
│ • Fallback _fallback_clauses() on exception │
│ • (2026 add-on: Instructor-style auto-retry, constrained │
│ decoding via tool_choice for guaranteed-valid JSON) │
│ │
│ Orchestration & control flow │
│ • Single LLM call; no sub-agents in v1 │
│ • (2026 add-on for cross-clause reasoning: sub-agent fan-out │
│ per dep pair, Critique agent as LLM-as-judge, hooks for │
│ step ceiling + cost ceiling, permissions / tool allow-list) │
│ │
│ Observability │
│ • Python logging │
│ • (2026 add-on: OpenTelemetry GenAI spans — gen_ai.usage.*, │
│ gen_ai.request.model, Langfuse / Phoenix / Braintrust │
│ export, LLM-as-judge eval suite in CI) │
│ │
│ Long-running concerns │
│ • SSE streaming with 10s keepalives (covered in scoring post) │
│ • (2026 add-on: durable execution via Temporal / Lambda │
│ Durable Functions for multi-hour pipelines, saga pattern │
│ for compensating actions) │
│ │
│ ┌─ CONTEXT LAYER ───────────────────────────────────────────┐ │
│ │ Input assembly │ │
│ │ • contract_text from pdfplumber │ │
│ │ • Truncation at 80,000 chars (front-loading assumption) │ │
│ │ • clause_type_list formatted from CUAD taxonomy │ │
│ │ • EXTRACTION_PROMPT.format(clause_types, contract_text) │ │
│ │ │ │
│ │ Window ordering │ │
│ │ • System prompt → instructions → schema → taxonomy │ │
│ │ → contract text → "IMPORTANT" reinforcement │ │
│ │ • Stable prefix first (cacheable) → variable input last │ │
│ │ • (2026 add-on: explicit cache_control breakpoint after │ │
│ │ taxonomy = 90% input-token cost reduction on reads) │ │
│ │ │ │
│ │ Retrieval & memory (not used in v1) │ │
│ │ • (2026 add-on for similar prior contracts: hybrid │ │
│ │ BM25+embedding retrieval, Cohere/Voyage reranker, │ │
│ │ semantic memory cache for re-analyzed contracts) │ │
│ │ │ │
│ │ Tool / agent context │ │
│ │ • No tools in this call (single-shot extraction) │ │
│ │ • (2026 add-on for tool-augmented version: MCP server │ │
│ │ for legal-tools-as-protocol, just-in-time tool │ │
│ │ loading instead of flat 30-tool registry) │ │
│ │ │ │
│ │ ┌─ PROMPT LAYER ────────────────────────────────────┐ │ │
│ │ │ • System: "You are a precise legal analysis AI." │ │ │
│ │ │ • Persona: senior legal analyst (contract │ │ │
│ │ │ framing — see Post 2) │ │ │
│ │ │ • Rubric: 4 risk levels with definitions │ │ │
│ │ │ • Perspective clause (signing party) │ │ │
│ │ │ • Closed-vocabulary constraint (exact names) │ │ │
│ │ │ • Schema: 6 fields, types + enums specified │ │ │
│ │ │ • Few-shot examples: zero (schema is enough) │ │ │
│ │ │ • Negative space: null vs "N/A", no preamble │ │ │
│ │ │ • "IMPORTANT: Return ONLY the JSON array..." │ │ │
│ │ └───────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Read this top-down. The harness layer owns the runtime: how the model is invoked, what parameters govern it, what tools it can reach, what happens to the output afterward, how the loop terminates, and what gets observed. The context layer owns the assembly: how the input gets retrieved, curated, ordered, truncated, formatted, and made cacheable before becoming a prompt. The prompt layer owns the instruction itself.
The parenthetical "2026 add-on" notes matter. The diagram you'd draw for a production agent in mid-2026 has more in every layer than the diagram you'd have drawn in 2024 — prompt caching, MCP, hooks, durable execution, structured-output stacks, and LLM-as-judge eval are all new harness-or-context primitives that didn't exist or weren't standardized two years ago. Posts 3 and 4 walk through them with worked examples and primary-source citations.
The three bugs from the previous section map cleanly:
max_tokens=2048→ harness layer- 80,000-character truncation → context layer
- Conversational preamble → fixed at the harness layer (the parser)
The closed-vocabulary fix, the perspective rubric, the schema, the IMPORTANT-caps — those are all genuinely prompt layer. They worked, they're keepers, they got the system 98% of the way home. They just couldn't go further on their own.
The Running Scenario for This Series#
Here's the scenario the next three posts revisit:
A lawyer uploads a 60-page MSA. The system needs to extract every clause against the 41-type CUAD taxonomy, score the contract from 0-100 for risk, and produce a one-paragraph executive summary that flags the top three concerns. Output must be machine-readable JSON. Cost: under one cent per contract. Latency: under 30 seconds.
That's the brief. In Post 2 (prompt engineering), I'll show how the prompt layer alone gets you most of the way to extraction — and exactly where it stops. In Post 3 (context engineering), I'll show how the same prompt, with a curated context window, fixes the long-document degradation problem and the cost ceiling. In Post 4 (harness engineering), I'll show how a sub-agent loop, a hook-enforced stop condition, and persistent memory get you the multi-clause reasoning (relating indemnity to liability cap, for example) that a single-shot extraction can't.
Each of those posts ends with a Same query, three rewrites box that shows the literal prompt, the literal context window, and the literal harness configuration side by side. By the end of post 4, you've watched the same input get solved three different ways, with the cost and outcome numbers to show what each layer added.
How to Know Which Layer Your Bug Is In (preview)#
Post 5 builds the full diagnostic framework. Here's the abridged version, because it's useful even before the series is finished.
Symptom: the model misunderstood what I wanted. → Prompt layer. Tighten the schema, add a negative example, make the rubric explicit.
Symptom: the model gave the right kind of answer about the wrong thing. → Context layer. Check what was actually in the context window. The retrieval probably pulled the wrong chunks, or the right chunks got buried.
Symptom: the model gave a partial answer / no answer / a malformed answer.
→ Harness layer. Check max_tokens, check that the tool call dispatched, check the parser, check the stop condition.
Symptom: the model contradicts itself across turns. → Harness layer (you don't have memory) or context layer (the conversation history isn't being included).
Symptom: the model is right, but it costs too much / takes too long. → Context layer. The model is doing fine; you're just paying for tokens you don't need. (This is the Context Engine thesis in one line.)
Symptom: the model is right on simple inputs and falls apart on complex ones. → Usually context layer (you're hitting the lost-in-the-middle threshold) or harness layer (you need a sub-agent decomposition). Rarely prompt.
Notice what's missing from that list: "rewrite the prompt 14 times." If the prompt got you to 98%, the prompt isn't your bottleneck. The remaining 2% almost certainly lives one or two layers up.
What's Next#
In Blog 2: Prompt Engineering — What's Still in Scope, I'll commit to the layer this series is named after. The prompt isn't dead; it's just smaller than it used to be. I'll show the four techniques that still matter — system-as-contract, schema-first output, calibrated few-shot, and negative space — using the actual Contract Analyzer extraction prompt as the worked example. Then I'll mark the exact place where the prompt layer stops being able to help, which is where Post 3 picks up.
This is post 1 of 5 in the Three Layers of AI Engineering Series. The full series covers prompt, context, and harness as three distinct engineering disciplines, with case studies drawn from five shipped projects.
Referenced projects:
- Contract Analyzer: github.com/MinhQuanBuiSco/contract-analyzer
- Context Engine: github.com/MinhQuanBuiSco/context-engine
- Deep Research Agent: github.com/MinhQuanBuiSco/deep-research-agent
- Clinical Research: github.com/MinhQuanBuiSco/clinical-research-agent
- Agent Observability: github.com/MinhQuanBuiSco/agent-observability