Back
AI Engineering··14 min

Context Engine Series — Blog 3: Semantic Memory & Research Caching

Your agent just researched AI agents. Five minutes later, someone asks about multi-agent systems. Build a semantic memory that caches past research using Jaccard similarity — no embeddings, no vector DB, just smart keyword matching.

Context Engine Series — Blog 3: Semantic Memory & Research Caching#

Your agent just researched AI agents. Five minutes later, someone asks about multi-agent systems. Why research it again?

The Deep Research Agent has no memory. Every query starts from scratch — full decomposition, fresh web searches, new LLM calls. This is the most wasteful pattern in production AI systems: redundant work on overlapping topics. A research agent that handles 100 queries a day is probably re-researching the same ground 20-30 times.

In this post, we build a semantic memory that caches past research findings and retrieves them when a similar query arrives. No embeddings. No vector database. No external service. Just Jaccard similarity on keyword sets, backed by a local JSON file. It's ~170 lines of Python that can eliminate entire research cycles.


The Context Engine Series#

PartTitleFocus
1Architecture & VisionSystem design, 6 techniques, pipeline overview
2Query Complexity & Result CompressionRule-based classifier, key sentence extraction
3Semantic Memory & Research Caching (this post)Jaccard similarity, local JSON cache
4Dynamic Tool SelectionIntent classification, tool loadout
5Source & Findings DeduplicationURL dedup, paragraph-level content hashing
6Context X-ray VisualizationReal-time token tracking, WebSocket events

The Problem: Stateless Agents Repeat Work#

Consider this sequence of queries from a real user session:

  1. "What are the latest advances in AI agents?"
  2. "Compare LangChain, CrewAI, and Strands SDK for building agents"
  3. "How do multi-agent systems coordinate task execution?"

These three queries share significant topical overlap. The research findings from query 1 contain information directly useful for queries 2 and 3. Without memory, the agent performs 3 full research cycles — 12 sub-queries, 60 web searches, 12 LLM calls for researchers alone. With memory, queries 2 and 3 get cached context injected into their synthesis prompts, potentially reducing or supplementing their web search needs.

The design goals:

  • No embeddings — embedding models add latency, cost, and infrastructure complexity. For research query matching, simpler methods work.
  • No external database — no Redis, no Pinecone, no Postgres. A local JSON file persisted via Docker volume.
  • Fast lookup — sub-millisecond search across up to 100 cached entries.
  • Automatic eviction — oldest entries are pruned when the cache exceeds its limit.
  • Integration transparency — the rest of the pipeline doesn't know whether findings came from cache or fresh research.

The Data Model#

Each cached entry stores the original query, extracted keywords for matching, the research findings, and source metadata:

@dataclass
class MemoryEntry:
    """A cached research result."""

    query: str
    query_hash: str
    keywords: list[str]
    findings: str
    sources: list[dict]
    created_at: str
    token_count: int

The query_hash is a truncated SHA-256 of the normalized query — used as the dictionary key for O(1) exact-match deduplication. The keywords list enables fuzzy matching via Jaccard similarity. The findings field is capped at 10,000 characters, and sources at 10 entries — enough context to be useful without bloating the cache file.


The SemanticMemory Class#

The full implementation lives in backend/context/memory.py:

class SemanticMemory:
    """Local file-backed memory for past research.

    Matching uses Jaccard similarity on normalized keyword sets.
    No embeddings, no external DB — simple and fast.
    """

    def __init__(self, storage_path: str = "./data/memory.json", max_entries: int = 100):
        self.storage_path = Path(storage_path)
        self.max_entries = max_entries
        self._cache: dict[str, MemoryEntry] = {}
        self._load()

The class initializes by loading existing entries from the JSON file. If the file doesn't exist or is corrupted, it starts with an empty cache — no crashes, no data loss from other entries.

Keyword Extraction#

The foundation of similarity matching is keyword extraction. We strip stopwords and short words, then normalize to lowercase:

_STOPWORDS = frozenset({
    "the", "a", "an", "is", "are", "was", "were", "be", "been", "being",
    "have", "has", "had", "do", "does", "did", "will", "would", "could",
    "should", "may", "might", "can", "shall", "must", "need",
    "in", "on", "at", "to", "for", "of", "with", "by", "from", "about",
    "and", "or", "but", "not", "no", "nor", "so", "yet",
    "what", "how", "why", "when", "where", "which", "who", "whom",
    "this", "that", "these", "those", "it", "its",
    "i", "me", "my", "we", "us", "our", "you", "your", "he", "she", "they",
})

def _extract_keywords(self, text: str) -> set[str]:
    """Extract normalized keywords from text, removing stopwords."""
    words = re.findall(r"\w+", text.lower())
    return {w for w in words if w not in _STOPWORDS and len(w) > 2}

For the query "What are the latest advances in AI agents?", this produces: {"latest", "advances", "agents"}. Short words (1-2 chars) and common stopwords are removed, leaving only the content-bearing terms.

The len(w) > 2 filter is important — it removes words like "AI" (2 chars), "ML", and "vs" that are too short to be meaningful discriminators. This is a trade-off: we lose some acronyms but gain much cleaner keyword sets. In practice, research queries tend to use the full terms ("artificial intelligence" rather than "AI") often enough that the loss is minimal.


Jaccard Similarity: The Matching Engine#

Jaccard similarity is the ratio of set intersection to set union. It's the simplest meaningful measure of set overlap:

def _jaccard_similarity(self, set_a: set[str], set_b: set[str]) -> float:
    """Compute Jaccard similarity between two sets."""
    if not set_a or not set_b:
        return 0.0
    intersection = len(set_a & set_b)
    union = len(set_a | set_b)
    return intersection / union if union > 0 else 0.0

Here's how it works for our example queries:

Query 1: "What are the latest advances in AI agents?"
Keywords: {"latest", "advances", "agents"}

Query 3: "How do multi-agent systems coordinate task execution?"
Keywords: {"multi", "agent", "systems", "coordinate", "task", "execution"}

Wait — "agents" and "agent" don't match because we're doing exact string comparison. This is a known limitation. Stemming would fix it (both become "agent"), but it adds a dependency and complexity. In practice, most research queries use consistent terminology, and the 0.5 similarity threshold is forgiving enough that partial keyword overlap still produces matches.

A better example:

Query A: "Compare Python web frameworks for building APIs"
Keywords: {"compare", "python", "web", "frameworks", "building", "apis"}

Query B: "Best Python frameworks for REST API development"
Keywords: {"best", "python", "frameworks", "rest", "api", "development"}

Intersection: {"python", "frameworks"} → 2
Union: {"compare", "python", "web", "frameworks", "building", "apis",
        "best", "rest", "api", "development"} → 10

Jaccard = 2/10 = 0.2  →  Below 0.5 threshold, no match

This is actually correct behavior — while these queries are related, the research findings from a general "compare Python web frameworks" query may not be useful for a specific "REST API development" query. The threshold is conservative by design: false negatives (missing a cache hit) cost a web search, but false positives (injecting irrelevant cached context) can degrade the final report quality.


Search: Finding Cached Entries#

The search method scans all cached entries and returns those above the similarity threshold, sorted by relevance:

def search(self, query: str, threshold: float = 0.5) -> list[tuple[MemoryEntry, float]]:
    """Find cached entries similar to the query."""
    query_keywords = self._extract_keywords(query)
    if not query_keywords:
        return []

    matches = []
    for entry in self._cache.values():
        entry_keywords = set(entry.keywords)
        similarity = self._jaccard_similarity(query_keywords, entry_keywords)
        if similarity >= threshold:
            matches.append((entry, similarity))

    matches.sort(key=lambda x: x[1], reverse=True)
    return matches

With a 100-entry cap, this linear scan is effectively instant. Even at 1,000 entries, iterating over keyword sets and computing set operations is sub-millisecond work. There's no need for approximate nearest neighbor search, HNSW indexes, or any of the infrastructure you'd need for embedding-based retrieval.


Store: Caching New Research#

After a successful research cycle, findings are cached for future reuse:

def store(self, query: str, findings: str, sources: list[dict]) -> None:
    """Cache research findings for future reuse."""
    keywords = list(self._extract_keywords(query))
    query_hash = sha256(query.lower().strip().encode()).hexdigest()[:16]

    entry = MemoryEntry(
        query=query,
        query_hash=query_hash,
        keywords=keywords,
        findings=findings[:10000],   # Cap at 10k chars
        sources=sources[:10],        # Cap at 10 sources
        created_at=datetime.now(timezone.utc).isoformat(),
        token_count=count_tokens(findings),
    )

    self._cache[query_hash] = entry

    # Evict oldest entries if over limit
    if len(self._cache) > self.max_entries:
        sorted_entries = sorted(
            self._cache.items(), key=lambda x: x[1].created_at
        )
        for key, _ in sorted_entries[:len(self._cache) - self.max_entries]:
            del self._cache[key]

    self._save()

A few design decisions worth noting:

10k character cap on findings. Research reports can be 20-30k characters. Storing everything would balloon the cache file and — more importantly — inject too much cached context into future synthesis prompts. 10k chars is enough to provide useful background without overwhelming the LLM.

10 source cap. Each source has a URL, title, and credibility score. 10 sources give enough citation coverage without bloating each cache entry.

LRU-style eviction. When the cache exceeds 100 entries, the oldest entries (by created_at) are evicted first. This is simple and effective — recent research is more likely to be relevant than research from weeks ago.

SHA-256 hash as key. Using a hash of the normalized query as the dictionary key means exact duplicates overwrite cleanly (idempotent), and there's no risk of key collision in practice with a 16-character hex digest.


Integration: Pre-processing and Post-processing#

The memory integrates at two points in the pipeline:

Before Research: Memory Lookup#

In ContextEngine.pre_process(), memory is searched before any research begins:

async def pre_process(self, query: str) -> AsyncIterator[dict]:
    # Stage 1: Classify complexity
    self.complexity = classify_complexity(query)
    self.budget = create_budget(self.budget_total, self.complexity.level)
    # ... emit X-ray event ...

    # Stage 2: Memory lookup
    matches = self.memory.search(query, threshold=0.5)

    if matches:
        self.memory_hits = [
            {
                "query": m.query,
                "findings": m.findings,
                "sources": m.sources,
                "similarity": round(score, 3),
                "created_at": m.created_at,
            }
            for m, score in matches[:3]  # Top 3 matches
        ]

If matches are found, they're stored in self.memory_hits for later injection. The Context X-ray panel shows the cache hits with their similarity scores.

During Synthesis: Context Injection#

Cached findings are injected into the synthesis prompt via get_memory_context():

def get_memory_context(self) -> str:
    """Get cached findings to inject into the synthesis prompt."""
    if not self.memory_hits:
        return ""

    parts = []
    for hit in self.memory_hits[:2]:  # Use top 2 matches
        parts.append(
            f"\n### Relevant past research (similarity: {hit['similarity']:.0%}):\n"
            f"Query: {hit['query']}\n"
            f"Findings summary: {hit['findings'][:2000]}\n"
        )
    return "\n".join(parts)

In the orchestrator, this context is injected right before the synthesis LLM call:

# Inject cached context from memory if available
memory_context = context_engine.get_memory_context()
memory_section = (f"\nRelevant past research:\n{memory_context}\n\n"
                  if memory_context else "")

synthesis_prompt = (
    f"Write a comprehensive research report on: {query}\n\n"
    f"Research findings:\n{combined_findings[:12000]}\n\n"
    f"{memory_section}"
    f"Critique assessment:\n{critique_text[:2000]}\n\n"
    f"Available sources for citations:\n{source_refs}\n\n"
    # ...
)

The memory context supplements fresh findings — it doesn't replace them. The synthesizer gets both fresh research and relevant cached context, allowing it to produce a more comprehensive report. If the cached findings overlap with fresh findings, the synthesizer naturally deduplicates by not repeating the same points.

After Research: Store Findings#

After synthesis completes, the orchestrator stores the new findings in memory:

# In orchestrator.run_research():
try:
    context_engine.store_research(
        query=query,
        findings=combined_findings[:10000],
        sources=[
            {"url": s["url"], "title": s.get("title", ""),
             "credibility_score": s.get("credibility_score", 50)}
            for s in unique_sources[:10]
        ],
    )
except Exception as e:
    logger.warning("memory_store_failed", error=str(e))

The try/except is intentional — memory storage is a best-effort optimization. If it fails (disk full, permission error), the research pipeline still succeeds and the user gets their report. No critical path depends on memory persistence.


Persistence: The JSON File#

The cache persists to ./data/memory.json, a directory mounted as a Docker volume:

def _save(self) -> None:
    """Persist cache to JSON file."""
    try:
        self.storage_path.parent.mkdir(parents=True, exist_ok=True)
        data = {key: asdict(entry) for key, entry in self._cache.items()}
        self.storage_path.write_text(json.dumps(data, indent=2))
    except Exception as e:
        logger.warning("memory_save_failed", error=str(e))

def _load(self) -> None:
    """Load cache from JSON file."""
    if not self.storage_path.exists():
        self._cache = {}
        return

    try:
        data = json.loads(self.storage_path.read_text())
        self._cache = {
            key: MemoryEntry(**entry) for key, entry in data.items()
        }
    except Exception as e:
        logger.warning("memory_load_failed", error=str(e))
        self._cache = {}

In the Docker Compose configuration, the ./data directory is mounted as a volume so the memory cache survives container restarts. A 100-entry cache with 10k-character findings each produces a JSON file of roughly 1-2 MB — trivial for disk storage.

Why Not a Vector Database?#

The obvious question: wouldn't embeddings + vector search be more accurate? Yes, probably — cosine similarity on sentence embeddings would catch semantic relationships that keyword overlap misses ("machine learning" ~ "neural networks"). But the trade-offs don't justify it for this use case:

  • Latency: Embedding a query takes 10-50ms (API call or local model inference). Jaccard similarity on keyword sets takes <0.1ms.
  • Infrastructure: A vector database (Pinecone, Qdrant, pgvector) needs provisioning, connection management, and monitoring. A JSON file needs a directory.
  • Accuracy: For research query matching, keywords overlap surprisingly well. Research queries tend to use explicit domain terminology — people searching for "Kubernetes deployment strategies" and "Kubernetes scaling best practices" share the keyword "kubernetes," which is the strongest matching signal.
  • Debugging: You can open memory.json in a text editor and see exactly what's cached. Try that with a vector index.

If the cache grew to 10,000+ entries or queries became more conversational ("tell me more about that thing we discussed"), embeddings would be the right call. At 100 entries with explicit research queries, Jaccard is the right tool.


The Memory X-ray Event#

When memory lookup runs during pre-processing, the Context X-ray panel shows the result:

# Cache hit event
{
    "type": "context_assembly",
    "stage": "memory_lookup",
    "tokens": 1200,         # Tokens from cached findings
    "naive_tokens": 0,      # Naive pipeline has no memory
    "items": [
        {"label": "Cache hits", "value": "2"},
        {"label": "Match: Compare Python web frameworks...",
         "value": "73% similar"},
        {"label": "Match: Best Python tools for API...",
         "value": "52% similar"},
        {"label": "Cached tokens available", "value": "1,200"},
    ]
}

# Cache miss event
{
    "type": "context_assembly",
    "stage": "memory_lookup",
    "tokens": 0,
    "naive_tokens": 0,
    "items": [
        {"label": "Cache hits", "value": "0 (no similar past queries)"}
    ]
}

The naive pipeline's naive_tokens is always 0 for memory — because the naive pipeline has no memory system at all. This means memory hits show up as "free" context in the X-ray comparison, highlighting the advantage of caching past research.


What's Next#

In Blog 4: Dynamic Tool Selection & Deduplication, we'll build the tool selector that classifies query intent (FACTUAL, COMPARATIVE, CONCEPTUAL, CURRENT_EVENTS) and only includes relevant tool descriptions in each agent's context — saving 100-220 tokens per agent. We'll also build the paragraph-level deduplicator that catches overlapping findings across parallel researchers using content hashing.


All code is from the Context Engine project — a fork of the Deep Research Agent.