Three Layers of AI Engineering — Blog 2: Prompt Engineering — What's Still in Scope#
In Post 1, I argued that "prompt engineering" had been bloated into a meaningless umbrella term — and that three of the worst Contract Analyzer bugs were one or two layers above the prompt, not on it. This post does the opposite work: it commits to the prompt layer and shows exactly what it's good for.
The four techniques in this post got the Contract Analyzer extraction call from a coin-flip to 98% reliability. They are not glamorous. There is no "10x your prompts with this one weird trick" pattern here. The techniques are: treat the system prompt as a contract, specify the output schema before you write the instruction, calibrate few-shot examples instead of adding them by reflex, and use negative space to mark what you don't want as carefully as what you do.
I'll show all four using the real extraction and scoring prompts. Then I'll mark the exact line where the prompt layer hits its ceiling and the context layer (Post 3) takes over.
The Three Layers of AI Engineering Series#
| Part | Title | Focus |
|---|---|---|
| 1 | Why I Stopped Calling It Prompt Engineering | The three-layer thesis, the running scenario, the decision question |
| 2 | Prompt Engineering — What's Still in Scope (this post) | System-as-contract, schema-first, few-shot, negative space |
| 3 | Context Engineering — Curating What the Model Sees | Complexity, compression, dedup, memory, lost-in-the-middle |
| 4 | Harness Engineering — The Loop Around the Model | Tools, sub-agents, hooks, schedulers, parsers, memory, observability |
| 5 | Picking the Right Layer: A Decision Framework | Diagnostic flowchart, iteration cost per layer, case studies across five projects |
Why I'd Rather Call It "Instruction Engineering"#
The phrase "prompt engineering" has rotted from overuse. It now covers retrieval, agent orchestration, fine-tuning data prep, and roughly anything involving an LLM. That overloading is the problem this series exists to solve.
What I mean by "prompt engineering" in this post — and what I'll defend as a real discipline — is much narrower:
The craft of writing the instruction string the model sees, in a way that constrains its behavior to produce the exact output your system needs.
That's it. Not retrieval. Not orchestration. Just: the string. Specifically, the system prompt and the user prompt for a single LLM call. The full set of artifacts in scope for this post:
- System prompt (the persona, the contract)
- User prompt (the instruction, the schema, the rubric, the examples)
- Output format specification
- The handful of decoder parameters whose effect is felt at the instruction level (temperature being the obvious one)
Everything else — max_tokens, retries, tool definitions, retrieval, sub-agents, memory — is the next two layers up. We'll get there.
Technique 1: System Prompt as Contract, Not Persona#
The first thing most tutorials teach you about system prompts is the persona pattern: "You are a helpful AI assistant." "You are an expert Python developer." "You are a sassy pirate." That's not wrong, exactly — but it sells the system prompt short.
The Contract Analyzer extraction call uses a one-line system prompt:
system_prompt="You are a precise legal analysis AI. Always respond with valid JSON only."
Compare that to the user prompt, which is 200+ lines including the full 41-clause taxonomy and the schema. The system prompt is doing almost no descriptive work. What it's doing is establishing a contract: from this point forward in the conversation, two things are non-negotiable — precision, and JSON-only output.
The contract framing changes how you write system prompts. Persona thinking asks "what kind of voice should the model adopt?" Contract thinking asks "what are the inviolable rules of this interaction?" The persona is easy to write and doesn't do much. The contract is harder to write and load-bears the whole call.
Three rules I now follow for system prompts:
1. The system prompt enforces what cannot be negotiated; the user prompt requests what can. "Always respond with valid JSON only" is a contract clause — there's no version of the conversation where conversational prose is acceptable. "Analyze these 41 clause types" is a request — it's specific to this call.
2. Repeating a constraint in both the system prompt and the user prompt is a feature, not a redundancy. From the Contract Analyzer extraction post: "System prompt alone got about 92% JSON-only responses; user prompt alone got about 95%; both together reached 98%." Two independent reinforcements of the same constraint at different points in the conversation produce a measurable bump. Belt-and-suspenders works.
3. The persona, when present, should justify the contract. "You are a precise legal analysis AI" earns the right to demand precision. "You are a sassy pirate" earns the right to demand sea shanties. The persona is the why behind the contract clauses.
The contract analyzer scoring pipeline shows the same pattern with a different shape. The second LLM call — the one that generates the executive summary — uses a completely different system prompt:
system_prompt="You are a senior legal analyst writing executive summaries..."
Different role, different contract. The extraction call demands rigid JSON; the summary call demands prose that flows. Same model, same SDK call — the system prompt is what makes them feel like two different services. Full context in the scoring post.
Technique 2: Schema First, Instruction Second#
The default way to write a prompt is to describe the task and then mention what the output should look like. ("Analyze this contract and return a list of clauses with..."). I now do it the other way around.
The schema goes first in my head, and the instruction wraps around it. The extraction prompt is literally laid out this way:
For each clause type listed below, determine:
1. Whether it is present in the contract (found: true/false)
2. If found, extract the most relevant text excerpt (1-3 sentences)
3. Assess the risk level for the party receiving/signing this contract:
- "low": Standard, balanced clause that protects both parties
- "medium": Slightly one-sided but common in industry
- "high": Significantly favors the other party, unusual terms
- "critical": Extremely one-sided, missing protections, or potentially harmful
4. Brief explanation of why this risk level was assigned
CLAUSE TYPES TO CHECK:
{clause_types}
CONTRACT TEXT:
{contract_text}
Respond with a JSON array of objects. Each object must have these exact fields:
- "clause_type": string (exact name from the list above)
- "category": string (the category group)
- "found": boolean
- "text_excerpt": string or null (if not found)
- "risk_level": "low" | "medium" | "high" | "critical"
- "explanation": string
IMPORTANT: Return ONLY the JSON array, no other text. Analyze ALL clause types listed."""
Notice what's happening: the schema is specified twice — once as the conceptual rubric (numbered list at the top) and once as the literal field-by-field type contract (bullet list at the bottom). The bottom one is where the work gets done.
Three details in that schema specification did real load-bearing work in production:
"clause_type": string (exact name from the list above). Without that parenthetical, the model would rephrase clause names in defensible but unjoinable ways: "Termination for Convenience" became "Convenience Termination Clause." The "(exact name from the list above)" constraint, combined with temperature 0.1, dropped the violation rate from 1-in-5 to 0-in-35 across the test suite. This is what I mean by closed-vocabulary classification — the model is choosing from a fixed list, not inventing names.
"text_excerpt": string or null (if not found). Without that, the model would return "", "N/A", "Not found", or sometimes "None" depending on which English idiom felt right. The frontend does text_excerpt && <div>{text_excerpt}</div> — "N/A" is truthy and renders a div containing "N/A". null is the only empty state that behaves correctly downstream, and the prompt has to say null. The model will not guess.
"risk_level": "low" | "medium" | "high" | "critical". TypeScript-style enum notation, which most modern LLMs recognize. Without enumerating the values, the model would occasionally invent levels: "moderate," "high-medium," "concerning." Enumeration prevents the invention.
The general rule: any field where you intend to write a downstream switch, filter, or join needs to be a closed enumeration in the prompt. Open-ended generation is what LLMs are good at, and exactly what you don't want for fields that will be programmatically consumed.
For production systems where this matters more than it does for a demo, three stronger alternatives exist: schema-constrained decoding (Bedrock and most providers enforce both syntax and schema at the decoder level), tool use / function calling (guaranteed valid JSON for the tool schema), and Pydantic + auto-retry on validation failure. Use them. The prompt-level technique above is the floor, not the ceiling.
Technique 3: Calibrated Few-Shot, Not Reflexive Few-Shot#
The default advice for any tricky prompt is "add a few-shot example." That advice is wrong about half the time. Few-shot examples are expensive (they consume input tokens), they're anchoring (the model will copy their style harder than you expect), and they're a maintenance burden (they have to evolve with the schema).
I now ask three questions before adding any few-shot example:
1. Will the model do the task correctly with zero examples? For the Contract Analyzer extraction call, the answer was yes — once the schema was specified rigorously. Claude Haiku 4.5 with temperature 0.1, the closed-vocabulary constraint, and the rubric got to 98% reliability with zero few-shot examples. Adding examples would have cost tokens without improving accuracy.
2. Will one example make the task much easier to understand? For tasks where the output shape is genuinely novel (custom DSLs, unusual rubrics, non-obvious decomposition), one example can be worth its weight in input tokens. Two examples are rarely necessary. Five examples almost always degrade performance — the model over-fits to the surface features of the examples and ignores the schema.
3. Are my few-shot examples calibrated against my actual failure modes? This is the one most teams skip. If the model is failing on edge cases — clauses with unusual phrasing, contracts with non-standard structure — your few-shot examples should be those edge cases, not the middle of the distribution. Adding a "happy path" example to fix an edge-case failure is the equivalent of studying the wrong section for the test.
For the extraction call, I tested both zero-shot and one-shot variants. The zero-shot version with the rigorous schema beat the one-shot version on accuracy by 1-2% across the test suite. The one-shot version was slower (more input tokens) and more brittle (the model would occasionally borrow phrasing from the example clause and inject it into unrelated outputs). I removed the example.
The lesson: few-shot is a tool, not a default. Reach for it when the schema can't carry the task alone — and skip it when the schema can.
Technique 4: Negative Space — What You Don't Want#
The most underrated prompt technique is naming the failure modes you've already seen, and instructing the model not to produce them. I call it negative space.
Two examples from the Contract Analyzer codebase.
The perspective rubric. Early prompts said "assess the risk level." The model would sometimes assess from the opposite side of the deal — a non-compete came back "low risk" because it is low risk to the employer. The fix was three words: "for the party receiving/signing this contract." That phrase doesn't describe what the model should do; it describes whose perspective the model should not take. Negative space.
IMPORTANT-caps reinforcement. The user prompt ends with IMPORTANT: Return ONLY the JSON array, no other text. Capitalizing IMPORTANT is silly. It's also empirically the easiest way to get the model to pay attention to a constraint you've already named in the system prompt. The "no other text" clause exists because in early tests, the model was prepending "Sure! Here's the analysis:" to roughly 5% of responses. That preamble isn't part of any rubric; it's a failure mode you suppress by naming.
The temperature tuning story from the scoring post is another negative-space lesson. The executive summary call originally ran at temperature 0.1, and every summary started identically: "This contract presents a medium risk profile..." Robotic and indistinguishable across contracts. I tried 0.7, and the model started hallucinating sections that didn't exist — a "Section 8.3" the contract didn't have, a "force majeure clause" that wasn't there, a "$500,000 liability cap" from no clause I'd ever passed it. 0.3 turned out to be the narrow band where the prose has voice but stays anchored to the structured input.
That's not really a prompt-text change — it's a decoder parameter — but the discovery came from noticing a negative-space failure: "the summaries all sound the same" and then "the summaries are inventing things." Naming the failure was what made the fix obvious.
Two operational rules:
- Keep a list of failure modes you've seen. When the prompt regresses, the first move is to scan that list — odds are good the regression is on a known axis.
- Resist the urge to write
don't hallucinate. That phrase is a wish, not a constraint. Models don't have a "hallucinate" parameter to turn off. Name the specific hallucination shape you don't want — "do not invent section numbers," "do not output dollar figures not present in the input." Specificity is what makes negative space work.
Same Query, Three Rewrites — Prompt Column (First Appearance)#
Here's the structure that recurs in Posts 3 and 4. Today we fill in only the prompt column.
SCENARIO: A lawyer uploads a 60-page MSA. Extract every clause against
the 41-type CUAD taxonomy, score risk 0-100, output JSON.
┌─────────────────────┬─────────────────────┬─────────────────────┐
│ PROMPT LAYER │ CONTEXT LAYER │ HARNESS LAYER │
├─────────────────────┼─────────────────────┼─────────────────────┤
│ System: precise │ (Post 3) │ (Post 4) │
│ legal AI, JSON │ │ │
│ only │ │ │
│ │ │ │
│ User prompt: │ │ │
│ - 4-level risk │ │ │
│ rubric with │ │ │
│ perspective │ │ │
│ - 41-clause │ │ │
│ taxonomy │ │ │
│ (closed vocab) │ │ │
│ - Schema: 6 fields, │ │ │
│ types & enums │ │ │
│ specified │ │ │
│ - IMPORTANT-caps │ │ │
│ reinforcement │ │ │
│ │ │ │
│ Temperature: 0.1 │ │ │
└─────────────────────┴─────────────────────┴─────────────────────┘
OUTCOME: 98% JSON-parseable extractions. Closed-vocabulary violations
near zero. Risk scores assess from the signing party's perspective.
Still fails on: long contracts (top-of-document attention degradation),
malformed runs (~2% conversational preamble), multi-clause reasoning
(no cross-clause inference).
The prompt column is full. The other two are empty — and the "Still fails on" line is the agenda for Posts 3 and 4. The prompt layer did everything it could. The remaining failures are not prompt problems.
What Prompts Can't Fix#
Three failure modes the Contract Analyzer extraction prompt hit, and which no amount of prompt rewriting could fix:
1. Clauses near the top of a long document marked found: false. The model had the text in its context window. It could see it. It just couldn't reliably attend to it once the document grew past ~30,000 tokens. No prompt fixes this; it's the lost-in-the-middle problem, and the fix is to curate the context window before the model ever sees it. That's Post 3.
2. ~2% of responses with conversational preamble or trailing chatter. "Sure! Here's the analysis:" + JSON + "Let me know if you need anything else." I rewrote the prompt four times to suppress it. Each rewrite ticked success rate up by ~0.3% and then plateaued. The actual fix was five lines of Python around the model call — a defensive parser that lives outside the prompt entirely. That's Post 4.
3. Multi-clause reasoning across the contract. A liability cap clause that only makes sense in relation to the indemnification clause. A termination-for-cause clause that depends on the notice-period clause. Single-shot extraction sees clauses in isolation; reasoning across clauses requires a loop, possibly sub-agents, possibly memory. None of that is a prompt change. That's also Post 4.
The throughline is the same as in Post 1: when you've gotten a prompt to 98% and the remaining 2% won't move, the 2% is almost certainly on a different layer. Stop tightening the prompt. Look up or look down.
A Note on Safety Prompts#
One application of prompt engineering deserves its own callout: safety. The Clinical Research Agent's safety layer (full post) is a worked example of how much prompt-layer work goes into not hallucinating medical advice.
Two patterns from that codebase that are worth lifting into any safety-critical prompt:
Severity tiering inside the instruction. The agent's safety prompt distinguishes high-severity claims (dose recommendations, contraindications, approval status — anything that could harm someone if wrong) from medium-severity claims (general study references). The prompt doesn't ask the model to "be careful"; it tells the model which categories of claim require explicit citation and which can pass with a hedge.
Explicit acceptable citation formats. The safety prompt names the citation patterns the system will recognize: PMIDs (PMID: 12345678), NCT IDs (NCT01234567), bracketed numerics ([1]). The model is given a closed vocabulary of citations the same way the contract analyzer is given a closed vocabulary of clause types. Same technique, different domain.
Safety prompts are where negative space becomes literal: every line in a safety prompt is naming a specific failure mode you don't want, with a specific cost if it happens. Hallucinating a clause name is annoying. Hallucinating a drug dose is dangerous. The discipline of writing the bad outcomes into the prompt — explicitly, by name — is what makes safety prompts feel different from ordinary prompts.
What's Next#
In Blog 3: Context Engineering — Curating What the Model Sees, the prompt stays exactly the same as it ended up in this post. We change the context window around it: complexity classification to right-size the input, key-sentence compression to cut search-result tokens, semantic memory to skip work we've already done, and dedup to prevent the same finding from being read twice. The cost drops by 34%. The accuracy goes up — because the model isn't drowning in noise.
The context layer is where the Contract Analyzer's "clauses near the top of long documents" bug actually gets fixed. The model never wanted that text buried in 60k tokens of boilerplate. We just hadn't given it a curated window yet.
This is post 2 of 5 in the Three Layers of AI Engineering Series. The full series covers prompt, context, and harness as three distinct engineering disciplines, with case studies drawn from five shipped projects.
Referenced projects:
- Contract Analyzer: github.com/MinhQuanBuiSco/contract-analyzer
- Clinical Research: github.com/MinhQuanBuiSco/clinical-research-agent
- Context Engine: github.com/MinhQuanBuiSco/context-engine
- Deep Research Agent: github.com/MinhQuanBuiSco/deep-research-agent
- Agent Observability: github.com/MinhQuanBuiSco/agent-observability