Context Engine Series — Blog 2: Query Complexity & Result Compression#

Not every question needs 4 parallel agents and 20 web searches. A simple "What is Python?" needs one search. A comparative analysis needs four. But the original Deep Research Agent treated every query identically — 4 sub-queries, 5 search results each, full content dumped into the context window. That's like hiring a team of 4 investigators to find out what time a store closes.

In this post, we build two techniques that together deliver the largest token savings in the entire Context Engine: a complexity classifier that adapts the number of sub-queries, and a result compressor that extracts only the sentences the LLM actually needs from each search result.

The Context Engine Series#

Part	Title	Focus
1	Architecture & Vision	System design, 6 techniques, pipeline overview
2	Query Complexity & Result Compression (this post)	Rule-based classifier, key sentence extraction
3	Semantic Memory & Research Caching	Jaccard similarity, local JSON cache
4	Dynamic Tool Selection	Intent classification, tool loadout
5	Source & Findings Deduplication	URL dedup, paragraph-level content hashing
6	Context X-ray Visualization	Real-time token tracking, WebSocket events

The Problem: One Size Fits None#

In the original Deep Research Agent, the decomposition prompt always asked for exactly 4 sub-queries:

decompose_prompt = (
    f"Decompose this research question into 4 specific, non-overlapping "
    f"search queries that together cover all aspects..."
)

That hardcoded 4 is the root of the waste. Consider three real queries:

Query	Ideal Sub-queries	Original Pipeline
"What is Python?"	1	4 (3 wasted)
"Explain React hooks and state management"	2	4 (2 wasted)
"Compare LangChain vs CrewAI vs Strands for production AI agents"	4	4 (correct)

For simple queries, we're doing 4x the work needed. Each unnecessary sub-query means an extra LLM call (agent creation + reasoning), extra Tavily API calls, and extra tokens of findings stuffed into the synthesis prompt. The fix is straightforward: classify the query's complexity before decomposition, and adjust the sub-query count accordingly.

The Complexity Classifier#

The classifier lives in backend/context/complexity.py. It's entirely rule-based — no LLM call, no embedding model, no API request. Just keyword matching and counting. Here's why: any LLM call to classify complexity would cost tokens, which defeats the purpose. The heuristic runs in microseconds and is right ~90% of the time — good enough.

The Data Model#

from enum import Enum
from dataclasses import dataclass


class QueryComplexity(Enum):
    SIMPLE = "simple"
    MODERATE = "moderate"
    COMPLEX = "complex"


@dataclass
class ComplexityResult:
    level: QueryComplexity
    sub_query_count: int
    reasoning: str

Three levels, each mapped to a sub-query count: SIMPLE gets 1, MODERATE gets 2, COMPLEX gets 4. The reasoning field stores a human-readable explanation that appears in the Context X-ray panel.

Keyword Sets#

The classifier looks for three categories of keywords that signal complexity:

_COMPARISON_KEYWORDS = {
    "compare", "vs", "versus", "difference between", "differences",
    "pros and cons", "advantages and disadvantages", "trade-offs",
}
_ANALYSIS_KEYWORDS = {
    "analyze", "analysis", "investigate", "research", "comprehensive",
    "deep dive", "survey", "state of", "landscape", "evaluate",
    "in detail", "explain in detail", "thorough",
}
_CURRENT_KEYWORDS = {
    "latest", "current", "recent", "2025", "2026", "trending", "trends",
    "what are the latest", "current state", "new developments",
}

Comparison keywords are the strongest signal — if someone says "compare X vs Y," they almost certainly need multiple sub-queries to cover each item being compared. Analysis keywords suggest depth. Current event keywords suggest the need for web search (the LLM's training data may be stale).

The Scoring Algorithm#

def classify_complexity(query: str) -> ComplexityResult:
    query_lower = query.lower().strip()
    word_count = len(query_lower.split())
    question_marks = query_lower.count("?")
    conjunctions = sum(query_lower.count(c) for c in [" and ", " or ", " but "])

    score = 0
    reasons = []

    # Length signals
    if word_count < 10:
        score -= 1
        reasons.append("very short query")
    elif word_count > 30:
        score += 1
        reasons.append("long detailed query")

    # Multiple questions
    if question_marks > 1:
        score += 1
        reasons.append(f"{question_marks} questions detected")

    # Conjunctions suggest multi-aspect
    if conjunctions >= 2:
        score += 1
        reasons.append(f"{conjunctions} conjunctions (multi-aspect)")

    # Comparison keywords
    if any(kw in query_lower for kw in _COMPARISON_KEYWORDS):
        score += 2
        reasons.append("comparison keywords detected")

    # Analysis keywords
    if any(kw in query_lower for kw in _ANALYSIS_KEYWORDS):
        score += 1
        reasons.append("analysis keywords detected")

    # Current events keywords
    if any(kw in query_lower for kw in _CURRENT_KEYWORDS):
        score += 1
        reasons.append("current events keywords detected")

    # Classify based on score
    if score <= 0:
        level = QueryComplexity.SIMPLE
        sub_query_count = 1
    elif score <= 2:
        level = QueryComplexity.MODERATE
        sub_query_count = 2
    else:
        level = QueryComplexity.COMPLEX
        sub_query_count = 4

    reasoning = (f"Score {score}: {', '.join(reasons)}"
                 if reasons else f"Score {score}: default classification")

    return ComplexityResult(
        level=level,
        sub_query_count=sub_query_count,
        reasoning=reasoning,
    )

The scoring system is additive. Each signal contributes to the score, and the final classification uses simple thresholds:

Score <=0  →  SIMPLE    →  1 sub-query
Score 1-2  →  MODERATE  →  2 sub-queries
Score >2   →  COMPLEX   →  4 sub-queries

Notice that comparison keywords add +2 (not +1) — they're a strong enough signal on their own to push a query from SIMPLE to COMPLEX. A short query with a comparison keyword like "React vs Vue" gets score = -1 + 2 = 1 (MODERATE, 2 sub-queries), which feels right.

How It Integrates#

In the orchestrator, the hardcoded 4 is replaced by the classifier's output:

# Before (Deep Research Agent):
decompose_prompt = (
    f"Decompose this research question into 4 specific..."
)

# After (Context Engine):
sub_query_count = context_engine.complexity.sub_query_count
decompose_prompt = (
    f"Decompose this research question into {sub_query_count} specific, "
    f"non-overlapping search queries..."
)

One line change, but the impact cascades: fewer sub-queries mean fewer researcher agents, fewer Tavily calls, fewer findings to synthesize, and fewer tokens across the entire pipeline.

Result Compression#

Even after reducing the number of sub-queries, each researcher still gets raw Tavily search results — 5 results per search, each with 500+ tokens of content. Most of that content is irrelevant. A search for "Python web frameworks 2026" returns articles where only 2-3 sentences actually address the query. The rest is navigation text, author bios, and tangential paragraphs.

The compressor in backend/context/compressor.py applies three techniques:

Technique 1: Limit Result Count#

_LIMITS = {
    QueryComplexity.SIMPLE:   {"max_results": 3, "max_content_chars": 300},
    QueryComplexity.MODERATE: {"max_results": 4, "max_content_chars": 400},
    QueryComplexity.COMPLEX:  {"max_results": 5, "max_content_chars": 500},
}

A SIMPLE query gets at most 3 results. COMPLEX gets 5. This is the bluntest optimization but also the safest — the results are already ranked by Tavily's relevance scoring, so dropping the bottom 2 rarely loses important information.

Technique 2: Key Sentence Extraction#

This is where the real compression happens. Instead of passing the full content of each result, we extract only the sentences most relevant to the query:

def _extract_key_sentences(content: str, query: str, max_sentences: int = 3) -> str:
    if not content:
        return ""

    # Split into sentences
    sentences = re.split(r"(?<=[.!?])\s+", content.strip())
    if not sentences:
        return ""

    # Normalize query keywords
    stopwords = {"the", "a", "an", "is", "are", "was", "were", "in", "on",
                 "at", "to", "for", "of", "and", "or", "but", "with",
                 "what", "how", "why", "when", "where", "which", "who"}
    query_words = {w.lower() for w in re.findall(r"\w+", query)} - stopwords

    if not query_words:
        return " ".join(sentences[:max_sentences])

    # Score each sentence by keyword overlap
    scored = []
    for sent in sentences:
        sent_words = {w.lower() for w in re.findall(r"\w+", sent)}
        overlap = len(query_words & sent_words)
        scored.append((overlap, sent))

    # Sort by relevance, take top N
    scored.sort(key=lambda x: x[0], reverse=True)
    top_sentences = [sent for _, sent in scored[:max_sentences]]

    # Return in original order (preserve reading flow)
    ordered = [s for s in sentences if s in top_sentences]
    return " ".join(ordered) if ordered else " ".join(top_sentences)

The algorithm is simple: split content into sentences, score each by keyword overlap with the query (ignoring stopwords), take the top 3, and return them in their original order to preserve reading flow. No TF-IDF, no embeddings, no LLM — just set intersection. It works because search results are already topically relevant (Tavily pre-filtered them), so keyword overlap is a strong enough signal for sentence-level extraction.

Technique 3: Character Truncation#

After key sentence extraction, the content is truncated to the complexity-dependent character limit:

if len(key_content) > max_content_chars:
    key_content = key_content[:max_content_chars].rsplit(" ", 1)[0] + "..."

The rsplit(" ", 1)[0] ensures we don't cut a word in half — the truncation happens at the last word boundary before the limit.

The Full Compression Pipeline#

def compress_tavily_results(
    results: list[dict],
    query: str,
    complexity: QueryComplexity = QueryComplexity.COMPLEX,
) -> CompressionResult:
    limits = _LIMITS[complexity]
    max_results = limits["max_results"]
    max_content_chars = limits["max_content_chars"]

    original_tokens = count_tokens(str(results))

    # Filter out low-relevance results (score < 0.3)
    filtered = [r for r in results if r.get("score", 1.0) >= 0.3]
    if not filtered:
        filtered = results

    # Limit count
    limited = filtered[:max_results]

    # Compress content
    compressed = []
    for r in limited:
        content = r.get("content", "")
        key_content = _extract_key_sentences(content, query, max_sentences=3)
        if len(key_content) > max_content_chars:
            key_content = key_content[:max_content_chars].rsplit(" ", 1)[0] + "..."

        compressed.append({
            "title": r.get("title", ""),
            "url": r.get("url", ""),
            "content": key_content,
            "score": r.get("score", 0),
        })

    compressed_tokens = count_tokens(str(compressed))

    return CompressionResult(
        results=compressed,
        original_count=len(results),
        compressed_count=len(compressed),
        original_tokens=original_tokens,
        compressed_tokens=compressed_tokens,
    )

The CompressionResult tracks both original and compressed token counts, so the X-ray panel can display exact savings percentages.

Tracking Savings#

@dataclass
class CompressionResult:
    results: list[dict]
    original_count: int
    compressed_count: int
    original_tokens: int
    compressed_tokens: int

    @property
    def savings_percent(self) -> float:
        if self.original_tokens == 0:
            return 0.0
        return round((1 - self.compressed_tokens / self.original_tokens) * 100, 1)

Integration: Where Compression Happens#

The compression doesn't happen in a separate pipeline step — it's embedded inside the researcher's tool. In the orchestrator, each researcher gets a capturing_tavily_search tool that compresses results before the agent even sees them:

def _create_source_capturing_tools(captured_sources, complexity=None):
    _complexity = complexity or QueryComplexity.COMPLEX

    @tool
    def capturing_tavily_search(query: str, max_results: int = 5) -> dict:
        """Search the web for information using Tavily API."""
        result = tavily_search(query=query, max_results=max_results)

        # Compress results based on complexity
        raw_results = result.get("results", [])
        if raw_results:
            compression = compress_tavily_results(raw_results, query, _complexity)
            result = {"results": compression.results}

        # Capture source metadata...
        return result

    return capturing_tavily_search, capturing_score_credibility

This is a critical design decision. The compression is transparent to the researcher agent — it calls capturing_tavily_search and gets back results that look normal, just shorter. The agent doesn't know it's getting compressed results, so it doesn't try to compensate or request more. The compression is a property of the tool, not the prompt.

Token Savings: Before and After#

Let's trace a SIMPLE query through both pipelines:

Query: "What is Kubernetes?"

Naive pipeline:

4 sub-queries decomposed
4 researchers x 5 results x ~500 tokens content = 10,000 search tokens
4 researchers x agent overhead = 2,400 tokens
Total research stage: ~12,400 tokens

Engineered pipeline:

Complexity: SIMPLE (score -1: "very short query") = 1 sub-query
1 researcher x 3 results x ~100 tokens compressed content = 300 search tokens
1 researcher x agent overhead = 600 tokens
Total research stage: ~900 tokens

That's a 93% reduction in the research stage alone. The overall pipeline savings are smaller (the synthesis prompt is similar-sized regardless), but for search-heavy queries, the compression is dramatic.

For a COMPLEX query like "Compare the top 5 cloud providers for ML workloads," the savings are more modest — 4 sub-queries in both pipelines — but compression still saves 40-50% of search tokens within each researcher.

What's Next#

In Blog 3: Semantic Memory & Research Caching, we'll build the memory system that avoids redundant research entirely. If someone asks about "AI agents" and then "multi-agent systems," the second query gets cached findings injected into the synthesis prompt — no web search needed. We'll implement Jaccard similarity on keyword sets, LRU eviction, and JSON file persistence.

All code is from the Context Engine project — a fork of the Deep Research Agent.