Back
View source
AI Engineering··14 min

Document Analysis MCP Series — Blog 1: Architecture & The MCP Pattern

The contract analyzer rejected RAG. This project embraces it — because 88-page SEC filings don't fit in Haiku's context window. Here's the MCP server architecture that lets both humans and AI agents analyze financial documents.

Document Analysis MCP Series — Blog 1: Architecture & The MCP Pattern#

Source code: github.com/MinhQuanBuiSco/document-analysis-mcp

The contract analyzer I built last month processes 50-page SaaS agreements in a single LLM call. No vectors, no retrieval, no chunking. The whole document fits in Haiku's 200K context window, and contract review is a classification task — 41 clause types, fixed output space, deterministic pipeline. I explicitly rejected RAG because it would have been overhead pretending to be architecture.

Then I opened a 10-K filing from a Fortune 500 company. 88 pages. Cross-referenced financial tables spanning 30 pages. Risk factors that reference accounting policies that reference footnotes that reference exhibits. A user asks "What was the year-over-year change in operating expenses?" and the answer lives across three sections separated by 40 pages.

Different problem. Different architecture.

This series documents how I built an MCP server for SEC financial filing analysis — a system that parses dense financial PDFs, chunks them semantically, indexes them in a vector database, and exposes seven tools that any AI agent can call through the Model Context Protocol. And because not every consumer is an AI agent, the same tools are also available through a REST API that powers a React frontend.


The Document Analysis MCP Series#

PartTitleFocus
1Architecture & The MCP Pattern (this post)System design, MCP protocol, dual interface, tech stack
2PDF Intelligence with DoclingLayout-aware parsing, table extraction, section detection
3Semantic Chunking & Vector SearchChonkie, BGE-M3, LanceDB, why semantic > naive
4RAG Q&A & Financial IntelligenceClaude for metrics/summary/Q&A, prompt engineering
5FilingScan Frontend & Real-Time ProgressCyan design system, SSE streaming, financial UI

The Portfolio Arc: Build, Optimize, Monitor, Apply#

This is the sixth project in a portfolio that follows a deliberate progression:

SeriesProjectQuestion Answered
Deep Research AgentBuildHow do you build a multi-agent research pipeline?
Context EngineOptimizeHow do you reduce token waste by 34%?
Agent ObservabilityMonitorHow do you see what your agent is actually doing?
Clinical Research AgentHealthcare DomainHow do you adapt AI for safety-critical medicine?
Contract AnalyzerLegal DomainHow do you apply AI to document-heavy business processes?
Document Analysis MCP (this series)Financial Domain + MCPHow do you build AI tools that other AI agents can use?

The first five projects all share a common assumption: the user is a human interacting through a UI. This project breaks that assumption. The primary consumer here is another AI agent — Claude Desktop, a coding assistant, a research pipeline — calling tools through MCP. The web frontend is a secondary interface built on top of the same tool functions. That inversion changes everything about how you design the API surface.


Why RAG Is Right Here (And Was Wrong for Contracts)#

I need to address this directly because I spent a section of the contract analyzer blog arguing against RAG. That argument was correct for that context. Here's why it does not apply to SEC filings.

The contract analyzer's case against RAG:

  • A 50-page SaaS contract is ~25,000 tokens. Fits in context.
  • Clause classification is a whole-document task. A termination clause on page 42 modifies a liability clause on page 8.
  • Chunking destroys those cross-references. RAG is overhead.

The SEC filing case for RAG:

  • An 88-page 10-K is ~120,000 tokens of dense text plus tables. That's 60% of Claude Sonnet's context window for the document alone — before any system prompt, instructions, or conversation history. Haiku, which I use for metrics extraction and summarization, has the same 200K window but you're burning most of it on raw text.
  • SEC filings are not whole-document tasks. "What was total revenue?" lives in Item 8. "What are the main risks?" lives in Item 1A. "How did R&D spending change?" requires Item 7 (MD&A). These are retrieval tasks — the user has a specific question and the answer lives in a specific region of the document.
  • Financial tables need to be extracted and preserved structurally. You cannot flatten a balance sheet into prose and expect an LLM to reason over it. The tables need layout-aware parsing, and the text around them needs semantic chunking that respects section boundaries.

The decision framework is straightforward: if your document fits in context and the task requires whole-document reasoning, skip RAG. If your document strains the context window and the task is point queries into specific regions, RAG is the right tool. Same engineer, different document, opposite architecture decision.


What Is MCP and Why It Matters Here#

The Model Context Protocol (MCP) is Anthropic's open standard for connecting AI models to external tools and data sources. Instead of every AI application implementing its own bespoke tool-calling interface, MCP defines a protocol: a server exposes tools with typed schemas, a client discovers and calls them.

Here is what that means concretely. When Claude Desktop connects to this MCP server, it sees seven tools:

upload_filing      — Parse and index a SEC filing PDF
list_sections      — Show document structure (items, headings, pages)
search_content     — Semantic search over the filing
extract_tables     — Get tables as markdown + structured data
get_financial_metrics — Extract revenue, EPS, margins into JSON
summarize_section  — Summarize a section or full document
ask_question       — RAG-powered Q&A with source citations

The AI agent reads the tool descriptions, decides which ones to call based on the user's request, and orchestrates multi-step workflows autonomously. A user says "Upload this 10-K and tell me how revenue changed year-over-year" and the agent calls upload_filing, then get_financial_metrics, then ask_question with a focused query — all without explicit programming of that workflow.

This is different from building a REST API for a frontend. A REST API serves a known UI — you design endpoints to match screens. An MCP server serves an unknown agent — you design tools to be composable, self-describing, and independently useful. The tool descriptions become part of the interface contract. They are not documentation; they are the prompt that tells the agent when and how to use each tool.

Here is the MCP server definition using FastMCP:

from fastmcp import FastMCP

mcp = FastMCP(
    name="Document Analysis",
    instructions=(
        "This server analyzes SEC financial filings (10-K, 10-Q). "
        "Start by uploading a PDF with upload_filing, then use the session_id "
        "to search, extract tables, summarize sections, or ask questions. "
        "Typical workflow: upload_filing -> list_sections -> search_content or "
        "ask_question -> extract_tables -> get_financial_metrics."
    ),
)

The instructions field is the system prompt for any agent connecting to this server. It tells the agent the expected workflow: upload first, then use the session_id. That single string eliminates an entire class of "tool not found" errors that would otherwise require retry logic.


Architecture Overview#

The system has four layers: document ingestion, vector storage, intelligence (LLM-powered tools), and two parallel interfaces.

                    ┌──────────────┐     ┌──────────────┐
                    │ Claude Desktop│     │ React Frontend│
                    │  (MCP Client) │     │  (REST Client)│
                    └──────┬───────┘     └──────┬───────┘
                           │                     │
                    MCP Protocol            REST / SSE
                    (stdio/SSE)            (HTTP JSON)
                           │                     │
                           ▼                     ▼
              ┌────────────────────────────────────────────┐
              │              INTERFACE LAYER                │
              │                                            │
              │  ┌─────────────┐    ┌──────────────────┐  │
              │  │  server.py   │    │     api.py        │  │
              │  │  (FastMCP)   │    │   (FastAPI)       │  │
              │  │  7 MCP tools │    │  REST endpoints   │  │
              │  └──────┬──────┘    └────────┬─────────┘  │
              │         │                    │             │
              │         └────────┬───────────┘             │
              │                  │                         │
              │          Same tool functions               │
              └──────────────────┼─────────────────────────┘
                                 │
                                 ▼
              ┌────────────────────────────────────────────┐
              │              TOOL LAYER                     │
              │                                            │
              │  upload.py   search.py   extract.py        │
              │  analyze.py  (pure functions, no I/O       │
              │               awareness)                   │
              └──────────────────┼─────────────────────────┘
                                 │
                                 ▼
              ┌────────────────────────────────────────────┐
              │            SERVICE LAYER                    │
              │                                            │
              │  ┌────────────┐  ┌───────────┐            │
              │  │ pdf_parser  │  │  chunker   │            │
              │  │  (Docling)  │  │ (Chonkie)  │            │
              │  └─────┬──────┘  └─────┬──────┘            │
              │        │               │                   │
              │        ▼               ▼                   │
              │  ┌────────────┐  ┌────────────────┐       │
              │  │  indexer    │  │ bedrock_client  │       │
              │  │ (BGE-M3 +  │  │ (Claude Sonnet  │       │
              │  │  LanceDB)  │  │  + Haiku)       │       │
              │  └────────────┘  └────────────────┘       │
              │                                            │
              │  ┌─────────────────┐                      │
              │  │ session_manager  │                      │
              │  │ (in-memory, TTL) │                      │
              │  └─────────────────┘                      │
              └────────────────────────────────────────────┘
                                 │
                                 ▼
              ┌────────────────────────────────────────────┐
              │            STORAGE LAYER                    │
              │                                            │
              │  ┌────────────┐  ┌──────────────────┐     │
              │  │  LanceDB    │  │  data/uploads/    │     │
              │  │ data/vectors│  │  (PDF files)      │     │
              │  └────────────┘  └──────────────────┘     │
              └────────────────────────────────────────────┘

The critical design decision here is that the tool functions are interface-agnostic. upload_filing(), search_content(), ask_question() — these are pure Python functions that take arguments and return dicts. They have no knowledge of whether they're being called by an MCP server or a REST endpoint. Both server.py and api.py import the same functions from the tools/ package. The MCP server wraps them with @mcp.tool decorators. The REST API wraps them with FastAPI route handlers. Same logic, two surfaces.


The Dual Interface: MCP + REST Bridge#

This is the architectural pattern I want to emphasize because it solves a real problem: MCP is for AI agents, but humans still need a UI.

The MCP server (server.py) registers seven tools with FastMCP:

@mcp.tool
def tool_upload_filing(file_path: str) -> dict:
    """Upload and parse a SEC filing PDF.

    Extracts text, sections, and tables using Docling,
    then chunks semantically and indexes with BGE-M3 + LanceDB.
    Returns a session_id for subsequent analysis.
    """
    return upload_filing(file_path)

The REST API (api.py) exposes the same function behind an HTTP endpoint, but with one difference — the upload endpoint uses SSE streaming to report progress:

@app.post("/api/upload")
async def api_upload(file: UploadFile = File(...)):
    """Upload a PDF filing for analysis.
    Returns an SSE stream with progress events."""

    async def event_stream() -> AsyncGenerator[str, None]:
        # Step 1: Parse PDF
        await queue.put({
            "type": "progress",
            "step": "parsing",
            "message": "Parsing PDF with Docling...",
        })
        document = await asyncio.to_thread(parse_pdf, str(dest))

        # Step 2: Chunk
        await queue.put({
            "type": "progress",
            "step": "chunking",
            "message": "Chunking text semantically with Chonkie...",
        })
        chunks = await asyncio.to_thread(chunk_document_sections, ...)

        # Step 3: Embed + Index
        await queue.put({
            "type": "progress",
            "step": "indexing",
            "message": "Embedding with BGE-M3 and indexing in LanceDB...",
        })
        await asyncio.to_thread(create_index, session_id, chunks, ...)

    return StreamingResponse(event_stream(), media_type="text/event-stream")

The MCP path calls upload_filing() synchronously — the agent waits for the result. The REST path calls the same underlying services but wraps them in an SSE stream so the React frontend can show a step-by-step progress timeline. Same processing pipeline, different delivery mechanism. The REST API is not a separate backend — it is a bridge that translates MCP-shaped tool functions into HTTP-shaped endpoints with UX affordances (progress, streaming, file upload) that the MCP protocol does not need.

For every other tool, the REST bridge is a thin wrapper:

@app.post("/api/search")
def api_search(req: SearchRequest):
    return search_content(req.session_id, req.query, req.top_k)

@app.post("/api/ask")
def api_ask(req: AskRequest):
    return ask_question(req.session_id, req.question)

No business logic in the API layer. The API layer is a protocol adapter, nothing more.


The Ingestion Pipeline#

When a PDF is uploaded, four stages run sequentially. Each stage has a single responsibility and a clean handoff to the next.

Stage 1: PDF Parsing (Docling)

Docling is IBM's layout-aware document converter. Unlike pdfplumber or PyMuPDF which treat PDFs as flat text, Docling understands document structure — headings, paragraphs, tables, lists, page layout. It outputs a structured document object with sections, tables exportable to DataFrames, and full Markdown representation.

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False  # SEC filings are digital PDFs
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

OCR is off because SEC filings are born-digital. Table structure detection is on and set to ACCURATE mode because financial tables are the highest-value content in a 10-K and getting them wrong means wrong numbers downstream. Blog 2 covers this in depth.

Stage 2: Semantic Chunking (Chonkie)

Raw text is split into semantically coherent chunks using Chonkie's SemanticChunker. The chunker uses a lightweight embedding model (potion-base-32M, 32M parameters) to measure similarity between sentences and groups them based on a 0.5 similarity threshold, targeting 512 tokens per chunk with a minimum of 2 sentences.

_chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-32M",
    threshold=0.5,
    chunk_size=512,
    min_sentences_per_chunk=2,
)

The chunking respects section boundaries — each section is chunked independently, so a chunk from "Risk Factors" never bleeds into "MD&A." If sections produce too few chunks (less than 3), the full document text is chunked as a fallback. Blog 3 covers why semantic chunking outperforms naive fixed-size splitting for financial text.

Stage 3: Embedding (BGE-M3)

Each chunk is embedded using BAAI's BGE-M3 model, producing 1024-dimensional dense vectors. BGE-M3 is a multi-lingual, multi-granularity embedding model that handles both short queries and long passages well — important when your chunks are 512 tokens but your queries are 10 words.

model = SentenceTransformer("BAAI/bge-m3")
embeddings = model.encode(texts, normalize_embeddings=True)

Embeddings are L2-normalized so cosine similarity reduces to dot product — a small optimization that matters when you are searching hundreds of chunks per query.

Stage 4: Indexing (LanceDB)

Vectors and metadata are stored in LanceDB, an embedded vector database that runs as a library (no server process). Each uploaded document gets its own table keyed by session ID. The schema stores the vector, text, section name, chunk index, and token count.

schema = pa.schema([
    pa.field("vector", pa.list_(pa.float32(), 1024)),
    pa.field("text", pa.string()),
    pa.field("section", pa.string()),
    pa.field("chunk_index", pa.int32()),
    pa.field("token_count", pa.int32()),
    pa.field("doc_filename", pa.string()),
])

LanceDB over FAISS, Pinecone, or Chroma: LanceDB is embedded (no separate service), stores data on disk in Lance columnar format (not in memory), supports metadata filtering natively, and has a clean Python API. For a local-first MCP server where each document is an isolated index, it is the right weight class.


Tech Stack#

ComponentTechnologyWhy This One
MCP FrameworkFastMCP 2.0Clean decorator API, handles protocol negotiation
PDF ParsingDocling (IBM)Layout-aware, table structure, section detection
ChunkingChonkie (SemanticChunker)Sentence-similarity grouping, respects boundaries
EmbeddingsBGE-M3 (1024-dim)Multi-granularity, strong on financial text
Vector DBLanceDBEmbedded, columnar, no server process
Q&A LLMClaude Sonnet 4 via BedrockReasoning over financial context with citations
Extraction LLMClaude Haiku 4.5 via BedrockFast structured extraction, low cost
BackendPython 3.12 + FastAPISSE streaming, async, multipart upload
FrontendReact 19 + TypeScript + ViteFast dev loop, modern patterns
StylingTailwind CSS v4Utility-first, theme switching
Local DevDocker ComposeTwo services, hot reload, AWS creds mounted

Two LLM models for two different tasks. Sonnet 4 for Q&A where reasoning quality matters — it needs to synthesize an answer from multiple retrieved passages and cite sources. Haiku 4.5 for metrics extraction and summarization where speed and cost matter — it is extracting structured JSON from financial tables, not reasoning about implications.


Session Management and Cleanup#

Each uploaded document creates a session — an in-memory object that holds the parsed document, a session ID, and a timestamp. Sessions expire after 1 hour of inactivity and are cleaned up by a background thread every 5 minutes.

@dataclass
class Session:
    session_id: str
    document: ParsedDocument
    created_at: float
    last_accessed: float

    @property
    def is_expired(self) -> bool:
        return time.time() - self.last_accessed > settings.session_timeout_seconds

When a session is removed, its LanceDB table is also deleted. This keeps disk usage bounded — each indexed document is ~50-100MB of vectors, and without cleanup you would accumulate storage on every upload.

The session ID is the coordination token between all seven tools. Upload returns it. Every subsequent tool requires it. This is the MCP equivalent of a database connection — it scopes all operations to a single document. An AI agent holds it in conversation context. The frontend holds it in React state.


Configuration#

The full configuration is 12 fields in a single Pydantic model:

class Settings(BaseModel):
    embedding_model: str = "BAAI/bge-m3"
    embedding_dim: int = 1024
    chunk_size: int = 512
    similarity_threshold: float = 0.5
    min_sentences: int = 2
    vector_db_path: str = "data/vectors"
    upload_dir: str = "data/uploads"
    session_timeout_seconds: int = 3600
    bedrock_region: str = "us-east-1"
    summary_model: str = "us.anthropic.claude-haiku-4-5-20251001-v1:0"
    qa_model: str = "us.anthropic.claude-sonnet-4-20250514-v1:0"
    max_tokens: int = 4096

Every parameter has a sensible default. Docker Compose mounts AWS credentials from ~/.aws — no environment variables for secrets, no .env files to manage. The only external dependency is Amazon Bedrock for LLM inference.


Project Structure#

document-analysis-mcp/
├── src/
│   ├── server.py            # MCP server — 7 tools via FastMCP
│   ├── api.py               # REST bridge — FastAPI endpoints + SSE
│   ├── config.py            # Pydantic settings, all defaults
│   ├── services/
│   │   ├── pdf_parser.py    # Docling PDF -> ParsedDocument
│   │   ├── chunker.py       # Chonkie semantic chunking
│   │   ├── indexer.py       # BGE-M3 + LanceDB vector index
│   │   ├── bedrock_client.py # Claude Sonnet/Haiku via Converse API
│   │   └── session_manager.py # In-memory sessions with TTL cleanup
│   └── tools/
│       ├── upload.py        # upload_filing()
│       ├── search.py        # list_sections(), search_content()
│       ├── extract.py       # extract_tables(), get_financial_metrics()
│       └── analyze.py       # summarize_section(), ask_question()
├── frontend/                # React 19 + TypeScript + Vite
├── docker-compose.yml       # Backend + Frontend, hot reload
├── Dockerfile
└── pyproject.toml

The layering is intentional. tools/ contains the seven functions that define the public interface — both MCP and REST. services/ contains the infrastructure: PDF parsing, chunking, embedding, LLM calls, session state. Tools call services. Services do not call tools. The interface layer (server.py, api.py) calls tools. No layer skips.


What Is Coming Next#

The architecture is in place. The next four posts cover the implementation details that make each layer work:

  • Blog 2 digs into Docling — how layout-aware parsing extracts tables that pdfplumber misses entirely, and why SEC filings specifically need it.
  • Blog 3 covers the chunking and search pipeline — why semantic chunking with Chonkie outperforms naive 512-token splits, how BGE-M3 handles the query-passage length mismatch, and LanceDB's embedded architecture.
  • Blog 4 is about the LLM layer — the RAG prompt engineering for Q&A, the structured extraction prompt for financial metrics, and the model selection reasoning for Sonnet vs. Haiku.
  • Blog 5 covers the FilingScan frontend — the cyan design system, SSE progress streaming, and the UI patterns for displaying financial data.

The pattern to watch across this series: every technical decision traces back to the properties of SEC filings specifically. Dense tables require Docling. Cross-section queries require RAG. Long documents require chunking. Structured financial data requires extraction prompts. The architecture is not a generic document-AI pipeline — it is shaped by the document type it serves.