Back
AI Engineering··25 min

Three Layers of AI Engineering — Blog 3: Context Engineering — Curating What the Model Sees

Same prompt as Post 2. Different context window. 34% fewer tokens, faster answers, and — counterintuitively — better accuracy. Seven techniques (complexity classification, compression, memory taxonomy, dedup, truncation, prompt caching, compaction), plus hybrid retrieval + reranking and GraphRAG as escalation paths. Real numbers from the Context Engine, Contract Analyzer, and current 2026 long-context benchmarks.

Three Layers of AI Engineering — Blog 3: Context Engineering — Curating What the Model Sees#

In Post 2, I committed to the prompt layer. The Contract Analyzer extraction prompt was 200 lines of system-as-contract, schema-first JSON, closed-vocabulary constraint, and IMPORTANT-caps reinforcement. It got the system to 98% reliability and stopped there.

Now I leave the prompt exactly as it was, and change the context window around it. Same model. Same instruction. Different input. The cost drops by 34%. The accuracy goes up. The bug from Post 1 where clauses near the top of long documents were marked found: false simply disappears — not because we rewrote the prompt, but because we stopped giving the model 60,000 tokens of boilerplate to drown in.

That's the thesis of this post: the context window is its own engineering surface. The five techniques I'll show — complexity classification, key-sentence compression, semantic memory, deduplication, and truncation — are all about what the model sees before it starts thinking. None of them require a single extra LLM call.


The Three Layers of AI Engineering Series#

PartTitleFocus
1Why I Stopped Calling It Prompt EngineeringThe three-layer thesis, the running scenario, the decision question
2Prompt Engineering — What's Still in ScopeSystem-as-contract, schema-first, few-shot, negative space
3Context Engineering — Curating What the Model Sees (this post)Complexity, compression, dedup, memory, lost-in-the-middle
4Harness Engineering — The Loop Around the ModelTools, sub-agents, hooks, schedulers, parsers, memory, observability
5Picking the Right Layer: A Decision FrameworkDiagnostic flowchart, iteration cost per layer, case studies across five projects

What Is Context Engineering, Exactly?#

A working definition, narrow on purpose:

The craft of deciding what enters the model's context window before the LLM is invoked — and how that input is assembled, ordered, compressed, and pruned.

What's in scope:

  • Retrieval (RAG): which chunks, how many, from where; hybrid sparse+dense + reranking
  • Compression: summarizing or distilling retrieved content
  • Truncation: cutting input before attention degrades
  • Deduplication: ensuring the same content doesn't appear twice
  • Memory: deciding what to carry forward from prior turns (episodic, semantic, procedural)
  • Tool/agent context: which tool definitions, which past turns, which intermediate state — increasingly served by MCP so tool defs become a token-budget decision, not a code decision
  • Cacheable prefix design: structuring stable content first so prompt caching pays
  • Compaction: summarizing old turns to free space in a long-running agent session
  • Window ordering: where each piece lives within the context, not just what's in it
  • Multimodal inputs: images and other modalities consume tokens the same way text does

What's not in scope (and lives one layer up or down):

  • Writing the instruction itself → prompt layer (Post 2)
  • Choosing tools and dispatching sub-agents → harness layer (Post 4)
  • Managing max_tokens, retries, and output parsing → harness layer

The clearest way to feel the boundary: in Post 2, I never changed what data the model saw. I only changed what I told the model to do with it. In this post, I never change the instruction — I only change the data. Two independent surfaces.


Why Bloated Context Makes Models Worse, Not Just More Expensive#

The intuition many people have about long contexts is "more is more" — the model has more to work with, so it should answer better. The empirical reality is the opposite, and it has a name: lost in the middle.

Research has shown repeatedly that LLMs attend to the beginning and end of a long context disproportionately. Information buried in the middle is less likely to be retrieved by the model than the same information placed near either edge. This is true even within the model's stated context limit. A 200k-token window doesn't mean the model attends to all 200k tokens uniformly — it means the model can read all 200k, then forget most of the middle.

This shows up in two flavors in production:

Flavor 1: Buried answer. The model has the information it needs, but it's between tokens 30,000 and 60,000. The model misses it and answers "not found" or, worse, fabricates an answer. The Contract Analyzer "clauses near the top marked as missing" bug from Post 1 is this flavor — except inverted (the model was apparently dropping the beginning of long inputs in addition to the middle, suggesting attention degradation kicked in non-uniformly past my threshold).

Flavor 2: Drowned signal. The model has the information, attends to it, but the answer is averaged across noise. The relevant 200 tokens are surrounded by 5,000 tokens of related-but-not-quite-the-question content. The model produces a hedged, generic answer instead of the specific one.

Both flavors cost you twice: once in tokens (you're paying for content that's not helping) and once in quality (the model is performing worse than it would on a smaller, cleaner input). Context engineering is the discipline of paying both bills.

The 2026 numbers: advertised window ≠ effective window#

The lost-in-the-middle tax was easy to ignore when "long context" meant 32k tokens. It is much harder to ignore now that frontier models advertise 1M-2M token windows but still degrade hard inside them.

A snapshot from the current long-context benchmarks (MRCR v2 is multi-needle reasoning at 1M tokens; NIAH is single-needle "needle in a haystack"):

Model / windowSingle-needle (NIAH)Multi-needle (MRCR v2)
Claude Sonnet 4.5 / 200k~98%~92%
Claude Opus 4.6 / 1M89%76%
Gemini 3.1 Pro / 1M99%26.3%
GPT-5.5 / 1M96%(varies; below Claude on multi-hop)

Single-needle retrieval is mostly solved at 1M. Multi-needle reasoning — which is what most real production workloads actually need — collapses on every frontier model. Gemini in particular goes from "near perfect on one needle" to "barely a quarter" on multi-hop reasoning at the same window size. Advertised context size is not the number that matters; benchmarked retrieval accuracy at your actual window is.

The practical consequence: in 2026, the case for context engineering is stronger than it was in 2024, not weaker. The Contract Analyzer's 80k-character truncation rule — which looked aggressive when the model advertised a 200k window — looks prudent when you read the benchmark numbers and realize effective accuracy was already degrading well inside that limit.

Sources: Long-Context Benchmarks Leaderboard, Anthropic — Effective context engineering for AI agents.


Technique 1: Complexity Classification — Right-Size Before You Retrieve#

The first decision in any retrieval-backed system is "how much to fetch." Most systems pick a constant — 5 chunks, 10 chunks, top-k=20 — and apply it to every query. That's wrong, because "What is Python?" and "Compare the top 5 Python web frameworks for production API deployment in 2026" need radically different amounts of context.

The Context Engine handles this with a rule-based classifier. No LLM call. Runs in microseconds:

def classify_complexity(query: str) -> ComplexityResult:
    query_lower = query.lower().strip()
    word_count = len(query_lower.split())
    question_marks = query_lower.count("?")
    conjunctions = sum(query_lower.count(c) for c in [" and ", " or ", " but "])

    score = 0
    reasons = []

    if word_count < 10:
        score -= 1
    elif word_count > 30:
        score += 1

    if question_marks > 1:
        score += 1
    if conjunctions >= 2:
        score += 1

    if any(kw in query_lower for kw in _COMPARISON_KEYWORDS):
        score += 2  # "compare", "vs", "pros and cons"
    if any(kw in query_lower for kw in _ANALYSIS_KEYWORDS):
        score += 1
    if any(kw in query_lower for kw in _CURRENT_KEYWORDS):
        score += 1

    if score <= 0:
        return ComplexityResult(level=SIMPLE, sub_query_count=1)
    elif score <= 2:
        return ComplexityResult(level=MODERATE, sub_query_count=2)
    else:
        return ComplexityResult(level=COMPLEX, sub_query_count=4)

The full numerical story from that post:

QueryOld approachEngineeredReduction
"What is Python?"4 sub-queries × 5 results × ~500 tokens = 12,400 total1 sub-query × 3 results × ~100 tokens = 900 total93%
"Compare top 3 Python web frameworks"Full pipeline: 4 × 5 × 500 = 12,4004 sub-queries (kept), compressed results = ~7,400~40%

Two observations.

Why rule-based instead of LLM-based? A classifier built on word counts, keyword detection, and conjunction parsing runs in microseconds. An LLM classifier runs in 1-2 seconds and costs tokens. For a routing decision that just picks 1 / 2 / 4 sub-queries, the heuristic is good enough — and a 1-2 second routing delay would defeat the point of the optimization. This is the right place to push back on the AI-everywhere instinct.

Why even bother classifying — just always retrieve 4 sub-queries? Because the lost-in-the-middle tax is real. Pushing 12,400 tokens through the pipeline for "What is Python?" doesn't just cost more — it gives the model 12,000 tokens of noise to read past before getting to the actual answer. Smaller queries actually answer better with less context.

The general rule: any constant in your retrieval pipeline (top-k, chunk size, number of agents) should be variable, and the variable should depend on the input. The cheapest possible classifier — a regex match, a word count, a switch on intent — is usually enough to get the lift.


Technique 2: Key-Sentence Compression#

After you've decided how many results to fetch, the next question is what to do with them. The default is "include the full result content." That default is wasteful. A typical web search result is 500-2,000 tokens of HTML-stripped content, and the LLM cares about maybe 50-150 tokens of that.

The Context Engine's compressor extracts only the sentences most relevant to the query — using set intersection on keywords, again with zero LLM calls:

def _extract_key_sentences(content: str, query: str, max_sentences: int = 3) -> str:
    sentences = re.split(r"(?<=[.!?])\s+", content.strip())
    query_words = {w.lower() for w in re.findall(r"\w+", query)} - STOPWORDS

    scored = [
        (len(query_words & {w.lower() for w in re.findall(r"\w+", sent)}), sent)
        for sent in sentences
    ]
    scored.sort(key=lambda x: x[0], reverse=True)
    top = [sent for _, sent in scored[:max_sentences]]

    # Preserve original order
    return " ".join(s for s in sentences if s in top)

Compression rate on the Context Engine: 40-80% reduction in search-result tokens, depending on the query type. Factual queries compress harder (the answer is in 2-3 sentences); comparative queries compress less (more of the result is relevant).

Two design decisions worth pulling out:

Why keyword overlap instead of embedding similarity? Same answer as the complexity classifier: speed and simplicity. Embedding the query and each sentence costs 10-50ms per sentence and requires a model. Keyword overlap is microseconds and requires nothing. For sentence-level relevance scoring within a single document, the keyword approach captures most of the signal. The 5-10% accuracy left on the table is not worth the latency hit.

Why preserve original sentence order? Models read context sequentially, and the order matters. Pulling out three high-relevance sentences and reordering them by relevance score produces grammatically weird inputs (since the third sentence might depend on context from the first). The compressor scores by relevance but emits in source order. Read order is a hidden axis of context engineering — and a recurring one. The same principle (preserve causal/narrative order; bias high-importance content toward the end of the retrieval block, where the model attends best; keep instruction-last so the model has all the evidence before reading the question) shows up at the window level too.

When keyword overlap stops being enough: hybrid retrieval + reranking#

The Context Engine's keyword-overlap approach works for query → past-finding matching where domain vocabulary repeats reliably. It does not work for paraphrase, conceptual queries, or any retrieval where the surface words diverge from the meaning. For those, the 2026 production default is hybrid retrieval + reranking:

  1. BM25 (sparse, lexical) plus dense embedding retrieval, fused with Reciprocal Rank Fusion.
  2. A cross-encoder reranker (Cohere Rerank 4 Pro, Voyage rerank-2.5, Zerank-2) re-scores the top-k results before they enter the prompt.

The lift is real and reproducible: adding Cohere Rerank v4.0 Pro to hybrid retrieval gives +17.2pp MRR@3 and +12.1pp Recall@5 over unreranked hybrid. The cost is 50-200ms of additional latency per query and a few hundredths of a cent per rerank call. For any retrieval problem where the user query and the documents don't share vocabulary, hybrid+rerank is now the floor of acceptable RAG — and "pure vector search" is the warning sign of a 2023-vintage system.

The Contract Analyzer doesn't need this because its retrieval is the contract text itself (no cross-document matching). The Context Engine doesn't need it because its memory-cache retrieval is a routing decision, not a quality-critical retrieval. But any system where the user asks "find docs about X" and X paraphrases the document content needs the reranker layer.

Sources: Reranking with Cohere/Voyage/Zerank guide, Hybrid Search reference 2026.


Technique 3: Memory as Context Curation — Episodic, Semantic, Procedural#

The third technique answers a question most retrieval pipelines never ask: "have I seen this query before?" — but it's the narrow slice of a larger question that the 2026 industry has converged on a vocabulary for.

The current taxonomy (borrowed from cognitive science by Mem0, LangMem, Letta, Zep, and broadly adopted) splits agent memory into three kinds:

  • Episodic — specific past interactions ("on March 4, the user asked about Pydantic v2 migration; here is what happened")
  • Semantic — durable facts, preferences, entities ("the user prefers FastAPI"; "this customer's account ID is 7821")
  • Procedural — learned workflows and skills ("the deploy process for this repo is: lint → test → build → push")

What follows in this section is semantic memory of research findings — durable facts about what was previously discovered, keyed by the query that surfaced them. Episodic memory (per-turn history) and procedural memory (learned workflows) are separate concerns, often persisted in separate backends. Calling all three "memory" without distinguishing them is the source of most "the agent doesn't remember things" debugging confusion.

The boundary with the harness layer: each memory backend lives in the harness (the storage, retrieval, and write policy is harness code). But deciding what to inject into the prompt is a context-engineering decision. You retrieve from semantic memory; you choose which two facts make it into the next call. That choice is what this section is about.

The Context Engine's semantic memory is a JSON-on-disk cache that keys research findings by query keywords. On a cache hit, the entire research pipeline is skipped — no web search, no LLM call, no orchestration. Just return the cached findings.

The matching uses Jaccard similarity on keyword sets:

def _jaccard_similarity(self, set_a: set[str], set_b: set[str]) -> float:
    if not set_a or not set_b:
        return 0.0
    intersection = len(set_a & set_b)
    union = len(set_a | set_b)
    return intersection / union if union > 0 else 0.0

Two design decisions from the memory post that are load-bearing:

Jaccard, not embeddings. Quoting the source: "Embedding a query takes 10–50ms (API call or local model inference). Jaccard similarity on keyword sets takes <0.1ms." For research queries — which tend to use explicit domain terminology — keyword overlap captures most of the semantic overlap. People searching for "Kubernetes deployment strategies" and "Kubernetes scaling best practices" share the keyword "kubernetes," and that's usually enough.

Threshold 0.5 — conservative on purpose. Jaccard ≥ 0.5 means at least half the keywords overlap. The post explains the asymmetry: "false negatives (missing a cache hit) cost a web search, but false positives (injecting irrelevant cached context) can degrade the final report quality." A bad cache hit is worse than a missed cache hit. The threshold is tuned for that asymmetry.

This is a key conceptual move: memory is context engineering, not magic. It's a curation decision about what enters the context window — namely, "this prior research finding, instead of the live web." The infrastructure is a 200-line Python module with a JSON file. The lift is enormous because the alternative is paying for another full research pipeline.


Technique 4: Deduplication#

In a multi-source retrieval pipeline (Context Engine runs four parallel researchers), the same finding will frequently appear in multiple sources. If you naively concatenate all researcher outputs, the model sees the same fact three times — paying tokens to read each repetition and weighting that fact disproportionately in its synthesis.

The fix is paragraph-level content hashing. Each researcher emits findings; the deduplicator normalizes (lowercase, strip whitespace, drop punctuation), hashes paragraphs, and keeps only the first occurrence. Full mechanics in the dedup post.

Numbers from production: dedup catches 15-30% of paragraphs in a typical complex query, with no quality loss. The remaining unique content is what gets passed to synthesis.

The architectural point: any pipeline that fans out (parallel agents, multi-source RAG, multi-tool calls) needs an explicit dedup step on the return path. Otherwise you're paying twice for the same context, and the model is weighting repeated content as more important than unique content.


Technique 5: Truncation — Cutting Before Attention Degrades#

This is the technique that actually fixes the Contract Analyzer "clauses near the top marked as missing" bug. The fix in production is one line:

max_chars = 80_000
if len(contract_text) > max_chars:
    contract_text = contract_text[:max_chars] + "\n\n[... contract text truncated ...]"
    logger.warning("Contract text truncated to %d characters", max_chars)

Three things about this fix.

It looks brutal. Cut at 80,000 characters — about page 30 of a typical contract — and append a "truncated" marker. That's it. No retrieval, no summarization, no scoring. Just chop.

It works because of the domain. Commercial contracts are front-loaded: the operative clauses (termination, liability, IP, payment) appear in the first half. Past page 30, you're in definitions, schedules, and boilerplate. The cost of dropping that content is near zero for the extraction task. Truncation is only viable when you understand what's near the cut.

It would be wrong for a different domain. Truncating a research paper at 80,000 characters would drop the methodology and conclusions. Truncating a legal brief might drop the binding precedent. The general rule is: truncation is context engineering at its bluntest, and it requires knowing what your input's information density curve looks like. Always log when truncation fires — that warning is your only visibility into how often you're cutting real signal.

A more refined variant exists: hierarchical summarization (summarize the back half of the document, keep the front half verbatim). For Contract Analyzer's risk-extraction use case, blunt truncation outperformed it because the summary added tokens without adding information the rubric cared about. Pick the technique that matches the task; don't reach for sophistication when bluntness works.


Technique 6: Stable-Prefix Architecture for Prompt Caching#

Every technique in this post so far has been about reducing the number of tokens in the context window. Prompt caching changes the optimization target: in 2026, the dominant question isn't "how few tokens?" — it's "how many of these tokens can be cached, and is the cacheable part at the front of the prompt?"

The economics are dramatic. On Anthropic's API today:

  • Cache writes cost 1.25x base input tokens (5-min TTL) or 2x (1-hour TTL)
  • Cache reads cost 0.1x base input — a 90% discount on cached content

That changes the math completely. A 60,000-token prompt where 55,000 tokens are stable (system prompt, tool definitions, taxonomy, background documents) and 5,000 tokens are dynamic (the user's actual contract) is cheaper per call than a tightly-engineered 20,000-token prompt that's 100% dynamic — once the cache is warm.

The shape of the discipline shifts accordingly:

1. Order your prompt by stability. Anthropic processes the prompt as Tools → System → Messages. Whatever's first must be the most stable. Any change to a higher-level block (a new tool, a system-prompt edit) invalidates every cache breakpoint downstream. The Contract Analyzer's CUAD taxonomy (the 41 clause types, formatted as a list) is the perfect cache candidate: ~3,000 tokens of identical content on every call, placed before the variable contract text.

2. Place explicit cache_control breakpoints. Anthropic's API takes a cache_control: {"type": "ephemeral"} marker at the breakpoint. Up to 4 breakpoints per request, each at a min-cacheable boundary (1,024 tokens for Sonnet/Opus, 2,048 for Haiku). A typical layout: one breakpoint after the system prompt + tool defs, one after the large background document, one before the latest user turn.

3. Mind the TTL. Anthropic dropped the default cache TTL from 60 minutes to 5 minutes in early 2026 — a quiet change that broke many production systems' implicit assumption that the cache would still be warm on the next call. Set ttl: "1h" explicitly if your traffic pattern is bursty or unpredictable; pay the 2x write cost in exchange for surviving idle periods.

4. The cache-hit hierarchy is brittle. Adding a single tool, reordering a few examples, or changing a single word in the system prompt invalidates the cache for the entire downstream prefix. Cache hits drop from 70%+ to 7% on any incautious edit. Treat the cached prefix like a stable ABI — version it, test cache behavior in CI, log cache hit/miss per request.

Real-world numbers from production deployments: 47-80% spend reduction on agent workloads where the system prompt and tool definitions are stable. ProjectDiscovery cut their LLM bill 59% by moving dynamic content after the cache breakpoint (a one-line code change). That kind of return is why caching deserves a section of its own rather than a footnote.

The same principle now exists across providers (OpenAI's cache_control equivalent, Google's implicit caching) and at the inference layer (vLLM, SGLang, TensorRT-LLM all do prefix-cache reuse via PagedAttention). API-level prompt caching is the user-facing exposure of an inference-server optimization that has been silently improving long-prompt economics for years.

Sources: Anthropic — Prompt caching, Mager — Prompt caching deep dive.


Technique 7: Context Compaction — Compression Across Time#

The six techniques above are spatial — they curate what enters the window within a single LLM call. Compaction is the temporal analogue: as an agent runs for many turns, the conversation history grows past what the window can hold, and old turns have to be summarized to free space.

This is the technique that every shipped agent system in 2026 uses and that no 2024-era context-engineering article covered.

The Claude Code pattern (now standard across Cursor, Codex CLI, OpenCode, and any serious agentic tool):

  • The harness watches token usage during the agent loop.
  • When usage crosses a threshold (~95% of window in Claude Code's case), an "auto-compact" routine triggers.
  • The routine summarizes everything older than the most recent N turns into a structured note that replaces the original content.
  • The agent continues with a compacted history that preserves decisions, tool results, and intermediate state — but at a fraction of the token cost.

Anthropic added this as native API surface area in late 2025 via the compaction_control parameter, so client code doesn't have to roll its own. The agent never knows compaction happened; it sees a continuous conversation.

Two design decisions worth knowing.

The compaction itself is a context-engineering decision, not a prompt one. The summarizer (often a separate LLM call, sometimes the same model in a different mode) decides what stays and what goes. Bad summarization → lost state → broken agent. Good summarization preserves decisions, tool outputs, file paths, error messages, and the user's stated goal; it drops verbose prose and redundant reasoning.

State should live outside the window when it can. The companion pattern: have the agent write a structured progress file (claude-progress.txt, notes/state.md, or a dedicated scratchpad tool) as it works. When compaction happens, the file survives even if the in-context summary loses fidelity. The agent re-reads the file on its next turn. This is the durable analogue of the in-memory compaction.

For the Contract Analyzer's single-shot extraction, compaction is irrelevant — there's only one turn. For a multi-turn legal assistant that helps a lawyer iterate on a contract over an hour-long session ("now compare clause 14 to the indemnification clause"; "now suggest revisions"; "now draft a counter-clause"), compaction is what keeps the agent functional past the 10-turn mark.

Sources: Anthropic — Compaction docs, Codex CLI / Claude Code compaction deep dive.


Same Query, Three Rewrites — Prompt + Context Columns#

Two columns full. Notice the prompt is unchanged from Post 2.

SCENARIO: A lawyer uploads a 60-page MSA. Extract every clause against
the 41-type CUAD taxonomy, score risk 0-100, output JSON.

┌─────────────────────┬─────────────────────────┬─────────────────┐
│   PROMPT LAYER      │   CONTEXT LAYER         │   HARNESS LAYER │
├─────────────────────┼─────────────────────────┼─────────────────┤
│ System: precise     │ Input pipeline:         │  (Post 4)       │
│   legal AI, JSON    │ - PDF → text via        │                 │
│   only              │   pdfplumber            │                 │
│                     │ - Truncate at 80k chars │                 │
│ User prompt:        │   (front-loading        │                 │
│ - 4-level risk      │    assumption)          │                 │
│   rubric with       │ - Format 41 CUAD clause │                 │
│   perspective       │   types as              │                 │
│ - 41-clause         │   "[Category] Name"     │                 │
│   taxonomy          │   list                  │                 │
│   (closed vocab)    │                         │                 │
│ - Schema: 6 fields, │ Window ordering:        │                 │
│   types & enums     │ - System prompt →       │                 │
│   specified         │   schema → taxonomy →   │                 │
│ - IMPORTANT-caps    │   contract → IMPORTANT  │                 │
│   reinforcement     │   (stable prefix first, │                 │
│                     │    variable input last) │                 │
│ Temperature: 0.1    │                         │                 │
│                     │ Caching (2026 add-on):  │                 │
│                     │ - cache_control after   │                 │
│                     │   taxonomy block        │                 │
│                     │ - Cache reads = 0.1x    │                 │
│                     │   base input cost       │                 │
│                     │ - Hit rate on batch     │                 │
│                     │   processing: ~85%      │                 │
│                     │                         │                 │
│                     │ Memory (semantic):      │                 │
│                     │ - Cache prior extracted │                 │
│                     │   contracts keyed by    │                 │
│                     │   doc hash (skip work)  │                 │
│                     │                         │                 │
│                     │ Compaction:             │                 │
│                     │ - N/A for single-shot;  │                 │
│                     │   would apply to multi- │                 │
│                     │   turn legal assistant  │                 │
│                     │                         │                 │
│                     │ Total context:          │                 │
│                     │  ~20,000 tokens         │                 │
│                     │  (was ~60,000)          │                 │
│                     │  Effective cost on warm │                 │
│                     │  cache: ~3,000 tokens   │                 │
│                     │  billed                 │                 │
└─────────────────────┴─────────────────────────┴─────────────────┘

OUTCOME: 98% JSON-parseable (unchanged). Long-document accuracy
restored — no more top-of-document "not found" failures. Cost down
~60% on contracts > 30 pages from truncation alone; another ~80%
on top of that for repeat or batch processing once caching is wired
in. Still fails on: ~2% conversational preamble, multi-clause
reasoning (cross-clause inference still impossible without an agent
loop).

The lift from the context layer is enormous, and it didn't touch the prompt. That's the point. Each layer has its own iteration surface, and context engineering is where the lost-in-the-middle bug actually lives.

The Context Engine series (six posts) is the full reference for a research-pipeline application of these techniques. For an extraction pipeline like Contract Analyzer, the relevant subset is truncation + caching + (for batch processing) memory. For a multi-agent pipeline like Deep Research, all seven techniques apply, with compaction as the load-bearing one for long sessions.


What Context Engineering Can't Fix#

Two failures the context layer can't reach, both setting up Post 4.

1. The 2% conversational preamble. Same as before. No amount of input curation prevents the model from prepending "Sure! Here's the analysis:" on bad days. The fix is a five-line defensive parser around the model call — a harness concern.

2. Multi-clause reasoning. A user asks: "Does the liability cap protect against the indemnification obligations?" That's a question about the relationship between two clauses. Single-shot extraction sees clauses in isolation. Even a perfectly curated context window doesn't make the model loop back and cross-reference. You need an agent loop, possibly with sub-agents fanning out to related clauses, possibly with memory of what's already been found. That's the harness layer.

A partial context-layer answer exists: structured retrieval over an extracted knowledge graph (Microsoft's GraphRAG, or one of the lighter-weight 2026 variants like LightRAG and LazyGraphRAG). The idea is to extract entities and relationships from the document at indexing time, then retrieve via graph traversal at query time — letting "liability cap" and "indemnification" pull each other in via a typed references edge. GraphRAG reports 87% accuracy on multi-hop reasoning vs. 23% for traditional RAG on the same questions, at the cost of higher upfront indexing complexity. For Contract Analyzer's single-document extraction it's overkill; for a contract-portfolio search system across thousands of MSAs it's the natural escalation path. The boundary holds, though: even GraphRAG is better context, not an agent loop — and questions that require multi-step reasoning still want the harness layer above it.

A more subtle harness failure that context engineering surfaces: when the Context Engine started skipping the research pipeline on cache hits, total response time dropped from 12s to 200ms. The frontend wasn't ready for that — the streaming progress UI assumed long latency and showed an awkward instant flash. The fix was a harness change (different streaming behavior for cache-hit vs. cache-miss responses), not a context change. Context layer optimizations often reveal harness-layer assumptions.


What's Next#

In Blog 4: Harness Engineering — The Loop Around the Model, I'll wrap the agent in everything that lives outside the model call: the sub-agent decomposition for cross-clause reasoning, the defensive parser for malformed runs, the hook-enforced stop condition for runaway loops, the persistent memory that makes the agent feel less amnesiac. The Same Query box gets its third column. By the end of that post, you've seen the full stack: prompt + context + harness — same input, three layers of engineering, one cohesive system.

Then in Blog 5, we put it all together: how to look at a misbehaving agent and know within thirty seconds which of the three layers to reach for.


This is post 3 of 5 in the Three Layers of AI Engineering Series. The full series covers prompt, context, and harness as three distinct engineering disciplines, with case studies drawn from five shipped projects.

Referenced projects: