Agent Observability Series — Blog 1: Architecture & OpenTelemetry Foundations#

Your AI agent is a black box. You know the input and output, but not the 47 LLM calls, 12 tool invocations, and 3 retry loops that happened in between.

I built a Deep Research Agent that decomposes questions, dispatches parallel researchers, critiques, and synthesizes cited reports. Then I built a Context Engine that cuts its token usage by 34%. Both work. But when a query takes 45 seconds instead of 12, or when the answer quality drops from 4.2 to 2.8, I have no idea why. The logs say "research_complete" and nothing else.

In this 5-part series, I'll show you how to build an Agent Observability layer — OpenTelemetry tracing with gen_ai.* semantic conventions, LLM-as-judge automated evaluation, per-request cost tracking, and anomaly detection — all visualized in a real-time React dashboard. This is the third and final series completing the trilogy: Build (Deep Research Agent) -> Optimize (Context Engine) -> Monitor (this project).

The Agent Observability Series#

Part	Title	Focus
1	Architecture & OpenTelemetry Foundations (this post)	System design, gen_ai.* conventions, span model
2	Tracing Multi-Agent Research Pipelines	Hierarchical traces, TraceCollector, dashboard API
3	Automated Evaluation with LLM-as-Judge	5-criteria scoring, quality trends
4	Token Cost Tracking & Budget Alerts	Per-model pricing, stage breakdown, budget alerts
5	Production Dashboard & Anomaly Detection	4-panel dashboard, Z-score anomaly detection

The Trilogy: Build -> Optimize -> Monitor#

This project completes a three-series arc on production AI agents:

Series	Project	Question Answered
Deep Research Agent	Build	How do you build a multi-agent research pipeline?
Context Engine	Optimize	How do you reduce token waste by 34%?
Agent Observability (this series)	Monitor	How do you see what your agent is actually doing?

Each series builds on the previous codebase. The Agent Observability project is a fork of the Context Engine, which was a fork of the Deep Research Agent. Same pipeline, same search tools, same model — but now every operation is traced, evaluated, and costed.

Why Agent Observability Is Different#

Traditional APM (Application Performance Monitoring) tracks HTTP requests, database queries, and error rates. You get latency percentiles, throughput, and 5xx rates. That works because web services are deterministic — the same endpoint does roughly the same work every time.

AI agents break every assumption:

Non-deterministic: The same query can produce different sub-queries, different search results, different synthesis paths. Latency varies 3-10x between runs.
Multi-step reasoning: A single user query triggers a cascade of LLM calls, tool invocations, and conditional branches. A "slow request" might be slow because of one bad search result that triggered 3 retries.
Cost per request varies wildly: A SIMPLE query costs $0.003. A COMPLEX query with 4 parallel researchers costs $0.04. You need per-request cost visibility, not just monthly bills.
Quality is invisible: A 200 OK tells you nothing. The agent might have returned a confident, well-cited answer — or a hallucinated mess. You need automated quality scoring.

Agent observability needs three pillars that traditional APM doesn't provide:

1. Hierarchical Tracing#

Not just "request took 12 seconds" but "decomposition took 800ms, researcher_1 took 4.2s (3 Tavily searches + 1 Bedrock call), researcher_2 took 3.8s, critique took 1.1s, synthesis took 2.1s." Every operation as a span, nested inside its parent.

2. Automated Evaluation#

After every research run, an LLM-as-judge scores the output on accuracy, completeness, citations, coherence, and relevance. One Haiku call, ~$0.001, gives you a quality signal you can track over time.

3. Cost Tracking#

Per-model pricing multiplied by actual token counts gives you per-request cost breakdowns. You can answer "which query type costs the most?" and "are we trending up or down?"

OpenTelemetry gen_ai.* Semantic Conventions#

OpenTelemetry is the industry standard for observability. In 2024, the OTEL community introduced gen_ai. semantic conventions* — a standardized set of span attributes for AI/LLM operations. Instead of every team inventing their own llm.model, ai.tokens, ml.cost attributes, gen_ai.* gives everyone the same vocabulary:

Attribute	Example	Purpose
`gen_ai.request.model`	`claude-haiku-4.5`	Which model was called
`gen_ai.operation.name`	`chat`	Type of operation
`gen_ai.usage.input_tokens`	`1250`	Tokens sent to the model
`gen_ai.usage.output_tokens`	`380`	Tokens generated
`gen_ai.request.temperature`	`0.7`	Sampling temperature
`gen_ai.response.finish_reason`	`stop`	Why generation stopped

We use these conventions throughout. When you look at a span, you immediately know the model, the token counts, and the cost — regardless of which LLM provider you're using.

Architecture Overview#

The observability layer wraps the existing Context Engine pipeline. Every stage emits spans that flow to a local collector, which serves them to the React dashboard:

                          ┌──────────────────────────────────────┐
                          │          OBSERVABILITY LAYER          │
                          │                                       │
  User Query ────────────►│  trace_span("research_pipeline")      │
                          │    ├─ trace_span("decomposition")     │
                          │    ├─ trace_span("researcher_1")      │
                          │    │    ├─ trace_span("tavily_search") │
                          │    │    └─ trace_span("bedrock_call")  │
                          │    ├─ trace_span("researcher_2")      │
                          │    ├─ trace_span("critique")           │
                          │    ├─ trace_span("synthesis")          │
                          │    └─ trace_span("evaluation")         │
                          │                                       │
                          └─────────────┬─────────────────────────┘
                                        │
                                        ▼
                          ┌──────────────────────────────────────┐
                          │       LOCAL TRACE COLLECTOR            │
                          │                                       │
                          │  TraceRecord → JSON file (per trace)   │
                          │  In-memory buffer (100 most recent)    │
                          │  Aggregated stats (p50, p95, avg)      │
                          │                                       │
                          └─────────────┬─────────────────────────┘
                                        │
                                        ▼
                          ┌──────────────────────────────────────┐
                          │         REACT DASHBOARD               │
                          │                                       │
                          │  ┌──────────┐  ┌──────────────────┐   │
                          │  │  Trace   │  │  Quality Chart   │   │
                          │  │ Waterfall│  │  (LLM-as-judge)  │   │
                          │  └──────────┘  └──────────────────┘   │
                          │  ┌──────────┐  ┌──────────────────┐   │
                          │  │   Cost   │  │   Alert Panel    │   │
                          │  │Breakdown │  │  (anomalies)     │   │
                          │  └──────────┘  └──────────────────┘   │
                          │                                       │
                          └───────────────────────────────────────┘

No external infrastructure. No Jaeger. No Datadog. The collector stores traces as JSON files on disk, keeps the 100 most recent in memory, and serves them via REST API to the dashboard. For a local development tool, this is exactly the right trade-off — zero setup, zero cost, instant visibility.

The SpanData Class#

The foundation of the tracing system is SpanData — a lightweight dataclass that represents a single operation in the trace:

@dataclass
class SpanData:
    """Lightweight span representation (no OTEL dependency for local dev)."""
    span_id: str = ""
    parent_id: str = ""
    trace_id: str = ""
    name: str = ""
    start_time: float = 0.0
    end_time: float = 0.0
    duration_ms: float = 0.0
    status: str = "ok"
    attributes: dict = field(default_factory=dict)
    events: list[dict] = field(default_factory=list)

    def to_dict(self) -> dict:
        return asdict(self)

A few design decisions worth noting:

No full OTEL SDK dependency. The full OpenTelemetry Python SDK is heavy — it pulls in grpc, protobuf, and a dozen transitive dependencies. For local development, we don't need OTLP exporters or remote collectors. We follow the gen_ai.* conventions (attribute names) without the SDK overhead.
parent_id for hierarchy. Every span knows its parent, which lets the dashboard render a nested waterfall view. When researcher_1 makes a tavily_search call, the search span's parent_id points to the researcher span.
attributes is a free-form dict. This is where gen_ai.* attributes go: gen_ai.request.model, gen_ai.usage.input_tokens, etc. No rigid schema — you add what you need per span type.
events for exceptions. When a span errors, it captures the exception message as an event, following OTEL's event model.

The trace_span() Context Manager#

The core instrumentation primitive is trace_span() — a context manager that automatically tracks timing, parent-child relationships, and error status:

# Global trace context
_current_trace_id: str = ""
_current_spans: list[SpanData] = []
_span_stack: list[str] = []  # Stack of parent span IDs


@contextmanager
def trace_span(name: str, attributes: dict | None = None):
    """Create a span that tracks timing and attributes.

    Uses gen_ai.* semantic conventions for AI-specific attributes.

    Example:
        with trace_span("bedrock_call", {
            "gen_ai.request.model": "claude-haiku-4.5",
            "gen_ai.operation.name": "chat",
        }) as span:
            result = await model.invoke(prompt)
            span.attributes["gen_ai.usage.input_tokens"] = result.input_tokens
            span.attributes["gen_ai.usage.output_tokens"] = result.output_tokens
    """
    global _current_trace_id

    if not _current_trace_id:
        _current_trace_id = str(uuid4())[:16]

    span = SpanData(
        span_id=str(uuid4())[:12],
        parent_id=_span_stack[-1] if _span_stack else "",
        trace_id=_current_trace_id,
        name=name,
        start_time=time.time(),
        attributes=attributes or {},
    )

    _span_stack.append(span.span_id)

    try:
        yield span
    except Exception as e:
        span.status = "error"
        span.events.append({"name": "exception", "message": str(e)})
        raise
    finally:
        span.end_time = time.time()
        span.duration_ms = round((span.end_time - span.start_time) * 1000, 2)
        _span_stack.pop()
        _current_spans.append(span)

The _span_stack is the key insight. It's a simple list that acts as a stack — when you enter a span, its ID gets pushed onto the stack. Any nested span created inside it automatically gets the top of the stack as its parent_id. When the span exits, it pops itself off. This gives you hierarchical traces without any explicit parent passing:

with trace_span("research_pipeline"):           # parent_id = ""
    with trace_span("researcher_1"):             # parent_id = research_pipeline
        with trace_span("tavily_search"):        # parent_id = researcher_1
            ...
        with trace_span("bedrock_call"):         # parent_id = researcher_1
            ...
    with trace_span("critique"):                 # parent_id = research_pipeline
        ...

Instrumenting the Research Pipeline#

Here's how app.py wraps the research pipeline with trace spans. The key section is in the WebSocket handler:

if should_research:
    trace_id = new_trace()
    research_start = time.time()
    context_engine = ContextEngine()

    with trace_span("research_pipeline", {
        "gen_ai.request.model": settings.bedrock_model_id,
        "query": message[:100],
        "complexity": complexity.level.value,
    }):
        async for event in run_research(message, conv_id, context_engine=context_engine):
            if event.get("type") == "text":
                assistant_response += event.get("data", "")
            elif event.get("type") == "sources":
                assistant_sources = event.get("sources", [])
            await send_event(event)

    # Post-research observability
    research_duration = (time.time() - research_start) * 1000
    spans = get_current_spans()

    # Evaluate quality with LLM-as-judge
    eval_result = await evaluate_research(message, assistant_response, assistant_sources)
    evaluation = eval_result.to_dict()

    # Track cost
    cost_tracker = get_cost_tracker()
    cost_tracker.start_request()
    for stage in context_engine.stages:
        if stage.tokens > 0:
            input_t = int(stage.tokens * 0.75)
            output_t = stage.tokens - input_t
            cost_tracker.record_call(stage.name, settings.bedrock_model_id, input_t, output_t)
    request_cost = cost_tracker.finish_request(query=message)

    # Store trace
    collector = get_collector()
    collector.store_trace(TraceRecord(
        trace_id=trace_id,
        query=message,
        start_time=datetime.fromtimestamp(research_start, tz=timezone.utc).isoformat(),
        end_time=datetime.now(timezone.utc).isoformat(),
        duration_ms=round(research_duration, 1),
        total_tokens=engineered_tokens,
        total_cost=request_cost.get("total_cost", 0),
        spans=[s.to_dict() for s in spans],
        evaluation=evaluation,
        complexity=complexity.level.value,
    ))

Every research request now flows through this pipeline: trace the execution, evaluate the output, cost the tokens, store the record, and check for anomalies. Five operations that add ~200ms to a 12-second research pipeline — a 1.6% overhead for complete visibility.

Live Demo#

Agent Observability Landing The Agent Observability interface — same research pipeline, new dashboard tab for traces, quality scores, cost breakdowns, and anomaly alerts

Agent Observability Dashboard The observability dashboard: trace waterfall (top-left), quality chart (top-right), cost breakdown (bottom-left), and alert panel (bottom-right)

Tech Stack#

Layer	Technology	Why
AI Model	Claude Haiku 4.5 via Amazon Bedrock	Fast, cost-effective for parallel agents + judge
Agent SDK	Strands Agents SDK	Clean `@tool` decorators, async support, Bedrock native
Backend	Python 3.12 + FastAPI	Async-first, WebSocket support
Frontend	React 19 + TypeScript + Vite	Modern, fast HMR
Tracing	OpenTelemetry gen_ai.* conventions	Industry standard span attributes
Evaluation	LLM-as-judge (Haiku)	Automated quality scoring, 5 criteria
Cost Tracking	Per-model pricing tables	Per-request cost breakdowns
Anomaly Detection	Z-score on latency/cost/quality	Flag 3-sigma outliers
Search	Tavily API	Purpose-built for AI search
Storage	Local JSON files	No external DB, zero setup
Dev Environment	Docker Compose	Backend + frontend with hot reload

The key addition versus the Context Engine is the observability/ package — 6 Python modules that implement tracing, evaluation, cost tracking, anomaly detection, a trace collector, and a dashboard API. No new infrastructure. No Jaeger. No Prometheus. Just Python code and JSON files.

Project Structure#

agent_observability/
├── backend/
│   ├── observability/              # NEW — Observability layer
│   │   ├── tracer.py               # OTEL setup + span helpers
│   │   ├── instrumentor.py         # Auto-instrument Bedrock/Tavily
│   │   ├── evaluator.py            # LLM-as-judge evaluation
│   │   ├── cost_tracker.py         # Token cost calculation
│   │   ├── anomaly_detector.py     # Z-score anomaly detection
│   │   ├── collector.py            # Local trace/metric storage
│   │   └── dashboard_api.py        # REST API for dashboard
│   ├── context/                    # Context engine (inherited)
│   ├── agents/                     # Orchestrator (instrumented)
│   ├── tools/                      # Tools (instrumented)
│   └── app.py                      # FastAPI + WebSocket + dashboard
├── frontend/
│   ├── src/
│   │   ├── components/
│   │   │   ├── dashboard/          # NEW — Dashboard components
│   │   │   │   ├── TraceWaterfall.tsx
│   │   │   │   ├── QualityChart.tsx
│   │   │   │   ├── CostBreakdown.tsx
│   │   │   │   ├── AlertPanel.tsx
│   │   │   │   └── DashboardPage.tsx
│   │   │   └── ...                 # Existing components
│   │   └── ...
├── data/
│   ├── memory.json                 # Context engine memory
│   ├── traces/                     # Trace storage (JSON per trace)
│   └── evaluations/                # Evaluation scores
└── docker-compose.yml

The observability layer sits alongside the context engine — both wrap the same research pipeline, but at different levels. The context engine optimizes what goes into the LLM. The observability layer measures what comes out.

Running It Yourself#

Prerequisites#

Python 3.12+ and Node.js 22+
Docker (for local dev)
AWS CLI with a named profile that has Bedrock access
Tavily API key — Free at tavily.com

Quick Start#

git clone https://github.com/MinhQuanBuiSco/agent-observability.git
cd agent-observability

# Create .env with your Tavily key
echo "TAVILY_API_KEY=tvly-your-key" > .env

# Start both services
docker compose up --build

# Open http://localhost:5173
# Dashboard at http://localhost:5173/dashboard

AWS Profile: The docker-compose.yml uses AWS_PROFILE=dev by default. Replace dev with your own profile name, or set the environment variable before running: AWS_PROFILE=your-profile-name docker compose up --build

What's Next#

In Blog 2: Tracing Multi-Agent Research Pipelines, we'll dive into the TraceCollector — how traces are stored as JSON files, loaded from disk on startup, and served via REST API. We'll walk through the hierarchical trace structure (session -> research -> sub-query -> LLM call -> tool use), the aggregated stats computation (p50, p95, avg duration), and the seeding pattern that gives the anomaly detector a baseline from historical traces.

This project is a fork of the Context Engine, which is a fork of the Deep Research Agent — read those series first for the base pipeline.

All code is open source: github.com/MinhQuanBuiSco/agent-observability