Back
AI Engineering··12 min

Context Engine Series — Blog 5: Source & Findings Deduplication

4 parallel researchers return 20 sources — 8 are duplicates. Remove them before the model wastes tokens on redundant information, using URL-based source dedup and paragraph-level content hashing.

Context Engine Series — Blog 5: Source & Findings Deduplication#

Four parallel researchers return 20 sources. Eight of them are duplicates -- the same Reuters article, the same Wikipedia page, the same Stack Overflow answer discovered independently by different agents. If you dump all 20 into the synthesis prompt, the model reads the same information multiple times, wastes tokens on redundant context, and sometimes even produces repetitive output that cites the same source under different numbers.

The fix is straightforward: deduplicate before synthesis. But there are two layers of redundancy to handle -- duplicate sources (same URL) and duplicate content (same paragraph appearing in multiple researchers' findings). In this post, we build both deduplication layers and wire them into the orchestrator.


The Context Engine Series#

PartTitleFocus
1Architecture & VisionSystem design, 6 techniques, pipeline overview
2Query Complexity & Result CompressionRule-based classifier, key sentence extraction
3Semantic Memory & Research CachingJaccard similarity, local JSON cache
4Dynamic Tool SelectionIntent classification, tool loadout
5Source & Findings Deduplication (this post)URL dedup, paragraph-level content hashing
6Context X-ray VisualizationReal-time token tracking, WebSocket events

Why Parallel Research Creates Duplicates#

The Deep Research Agent decomposes a query like "Compare LangChain, CrewAI, and Strands for production agents" into 4 sub-queries. Each researcher agent independently searches the web via Tavily and scores sources for credibility. The problem is that popular, authoritative sources appear for multiple sub-queries.

Researcher 1 (LangChain features) finds docs.langchain.com, github.com/langchain-ai, and a TechCrunch article comparing frameworks. Researcher 3 (Strands SDK) also finds that same TechCrunch article. Researcher 4 (framework comparison) finds it too, plus duplicates of the official docs.

The result: 20 total sources, but only 12 unique URLs. The 8 duplicates waste tokens in two ways -- once in the source reference list, and again when the researchers' findings contain overlapping paragraphs that quote or summarize the same material.


Layer 1: Source Deduplication by URL#

The first layer is simple: group sources by URL, keep the best version, sort by credibility.

def deduplicate_sources(all_sources: list[dict]) -> tuple[list[dict], int]:
    """Deduplicate sources by URL, keeping the one with highest credibility score.

    Args:
        all_sources: Sources collected from all parallel researchers.

    Returns:
        Tuple of (unique_sources sorted by credibility desc, num_pruned).
    """
    seen_urls: dict[str, dict] = {}
    for src in all_sources:
        url = src.get("url", "")
        if not url:
            continue
        if url not in seen_urls or src.get("credibility_score", 0) > seen_urls[url].get("credibility_score", 0):
            seen_urls[url] = src

    unique = list(seen_urls.values())
    unique.sort(key=lambda s: s.get("credibility_score", 0), reverse=True)
    pruned = len(all_sources) - len(unique)

    if pruned > 0:
        logger.info("sources_deduplicated", total=len(all_sources), unique=len(unique), pruned=pruned)

    return unique, pruned

The key design choice here is credibility-aware deduplication. When two researchers find the same URL, they may have scored its credibility differently (one might have called score_credibility with a better content snippet). We keep the version with the higher credibility score because it represents the most informed assessment of that source's quality.

The final sort by credibility descending matters too. When the synthesis prompt includes "Available sources for citations," the most authoritative sources appear first. The model naturally gravitates toward earlier items in a list, so high-credibility sources get cited more often.


Layer 2: Findings Deduplication with Content Hashing#

Source dedup handles identical URLs, but two researchers can independently write similar paragraphs about the same topic even from different sources. Researcher 1 writes "LangChain provides a modular framework for building LLM applications with chains and agents" and Researcher 4 writes nearly the same thing.

We catch this with paragraph-level MD5 hashing on normalized text:

def deduplicate_findings(findings_list: list[str]) -> tuple[str, int]:
    """Remove duplicate paragraphs across multiple researcher findings.

    Uses paragraph-level content hashing to detect and remove duplicates
    while preserving the order of first occurrence.

    Args:
        findings_list: List of findings text from each researcher.

    Returns:
        Tuple of (deduplicated combined text, tokens_saved).
    """
    original_text = "\n\n".join(findings_list)
    original_tokens = count_tokens(original_text)

    seen_hashes: set[str] = set()
    unique_paragraphs: list[str] = []

    for findings in findings_list:
        paragraphs = re.split(r"\n{2,}", findings.strip())
        for para in paragraphs:
            para = para.strip()
            if not para or len(para) < 20:
                continue

            # Normalize for hashing: lowercase, collapse whitespace
            normalized = re.sub(r"\s+", " ", para.lower().strip())
            para_hash = hashlib.md5(normalized.encode()).hexdigest()

            if para_hash not in seen_hashes:
                seen_hashes.add(para_hash)
                unique_paragraphs.append(para)

    deduped_text = "\n\n".join(unique_paragraphs)
    deduped_tokens = count_tokens(deduped_text)
    tokens_saved = max(0, original_tokens - deduped_tokens)

    return deduped_text, tokens_saved

There are several deliberate choices in this implementation worth understanding:

Paragraph-level granularity. We split on double newlines (\n{2,}) rather than sentences. Sentence-level dedup would be too aggressive -- it would remove common phrases like "According to the documentation" that appear in legitimately different contexts. Paragraph-level is the right granularity because a full duplicate paragraph almost always represents truly redundant information.

Normalization before hashing. Two paragraphs might differ only in whitespace or capitalization. re.sub(r"\s+", " ", para.lower().strip()) collapses all whitespace variations and lowercases the text so that "LangChain provides..." and "langchain provides..." hash to the same value.

Minimum length filter. Paragraphs under 20 characters (like "Sources:" or "Summary:") are skipped. These are structural markers that might legitimately repeat across researchers' outputs.

First-occurrence ordering. We iterate through findings in order and keep the first occurrence of each paragraph. This preserves the narrative flow of the first researcher's output and drops redundant paragraphs from later researchers.

MD5 is fine here. We are not defending against adversarial collisions -- we just need fast content fingerprinting. MD5 runs in microseconds and has negligible collision risk for paragraphs of natural language text.


Integration in the Orchestrator#

Both deduplication functions are called after parallel research completes, before the findings enter the critique and synthesis stages:

# Step 4: Extract & consolidate findings
all_sources: list[dict] = []
all_findings: list[str] = []
for i, result in enumerate(successful):
    all_findings.append(
        f"### Sub-query {i + 1}: {result.sub_query}\n\n{result.findings}"
    )
    all_sources.extend(result.sources)

# Deduplicate findings at paragraph level
combined_findings, findings_tokens_saved = deduplicate_findings(all_findings)

# Deduplicate sources by URL
unique_sources, sources_pruned = deduplicate_sources(all_sources)

The order matters. We deduplicate findings first (which operates on the text) and sources second (which operates on metadata). Both run in milliseconds -- no LLM calls, no API calls.


X-ray Visibility#

The deduplication stage emits a context_assembly event so the Context X-ray panel can show exactly what happened:

event = context_engine.record_stage(
    name="deduplication",
    tokens=dedup_tokens,
    naive_tokens=naive_dedup_tokens,
    items=[
        {"label": "Sources: total", "value": str(len(all_sources))},
        {"label": "Sources: unique", "value": str(len(unique_sources))},
        {"label": "Sources pruned", "value": str(sources_pruned)},
        {"label": "Findings tokens saved", "value": f"{findings_tokens_saved:,}"},
    ],
)

In the X-ray panel, you see a card like:

Deduplication                         -28%    4.2k t
  Sources: total                                  20
  Sources: unique                                 12
  Sources pruned                                   8
  Findings tokens saved                        1,847

The -28% badge shows the savings at this stage compared to the naive approach (no dedup).


Real Numbers#

Across production runs, deduplication typically delivers:

MetricBefore DedupAfter DedupSavings
Sources18-2410-1430-45% fewer
Findings tokens6,000-8,0004,200-5,80025-35% fewer
Synthesis input~12,000~8,500~30% smaller

The synthesis prompt is the single most expensive part of the pipeline -- it receives all findings, all source references, the critique assessment, and memory context. Every token we remove from that prompt directly reduces the cost of the final (and most expensive) LLM call.

For a complex comparative query with 4 sub-queries, findings dedup alone saves 1,500-2,000 tokens. Source dedup saves another 300-500 tokens from the source reference list. Combined, that is roughly 2,000 tokens saved from the synthesis input -- a meaningful reduction that compounds across requests.


Why Not Semantic Deduplication?#

You might wonder: why use exact MD5 hashing instead of semantic similarity? If Researcher 1 writes "React uses a virtual DOM" and Researcher 3 writes "React employs a virtual DOM for efficient rendering," those paragraphs are semantically similar but hash differently.

We chose exact hashing for three reasons:

  1. Speed. MD5 hashing runs in microseconds. Semantic similarity requires embedding computation, which adds latency and -- ironically -- costs tokens if you use an LLM for it.

  2. Precision. Semantic dedup risks removing paragraphs that are similar but contain genuinely different details. "React uses a virtual DOM" and "React uses a virtual DOM that batches updates in a fiber architecture" are similar but the second adds important information. Exact dedup is conservative -- it only removes truly identical content.

  3. The real duplicates are exact. In practice, most duplication in our pipeline comes from researchers independently quoting the same source verbatim or from the LLM producing near-identical summaries of the same search result. Normalization (lowercase + whitespace collapse) catches these cases without needing embeddings.

The semantic memory module (Blog 3) already handles semantic similarity at the query level. Within a single research run, exact paragraph hashing is the right tool.


What This Module Replaces#

Before extracting deduplication into its own module, the orchestrator had inline dedup logic scattered across the consolidation step -- a seen_urls set here, a manual content check there. The deduplicator.py module centralizes this into two clean functions with consistent logging and token tracking.

The module depends only on context.budget.count_tokens for token measurement and Python standard library modules (hashlib, re). No external dependencies, no API calls, no latency.


What's Next#

We have now covered every stage of the Context Engine pipeline: complexity classification, result compression, semantic memory, dynamic tool selection, and deduplication. But how do you know any of this is working? How do you see the token savings in real time?

In Blog 6: Context X-ray Visualization, we build the real-time visualization panel that shows every context engineering stage, every token count, and every saving -- with collapsible stage cards, a dual-bar token budget meter, and live WebSocket streaming. You cannot optimize what you cannot see.


All code is open source: github.com/MinhQuanBuiSco/context-engine