Agent Observability Series — Blog 3: Automated Evaluation with LLM-as-Judge#
Your agent returned an answer. Is it good? Accurate? Complete? Are the sources credible? LLM-as-judge evaluates every response automatically — 500x cheaper than human review.
In Blog 1 we set up OpenTelemetry tracing, and in Blog 2 we built the TraceCollector to persist and query traces. But tracing only tells you how long things took and how many tokens were consumed. It doesn't tell you whether the output was any good.
A research pipeline can run perfectly — all spans green, latency within p50, cost on budget — and still produce a mediocre answer. Maybe the search results were thin. Maybe the synthesis hallucinated a statistic. Maybe the citations are dead links. You need automated quality scoring that runs after every research pipeline, scores the output on multiple criteria, and tracks trends over time.
The Agent Observability Series#
| Part | Title | Focus |
|---|---|---|
| 1 | Architecture & OpenTelemetry Foundations | System design, gen_ai.* conventions, span model |
| 2 | Tracing Multi-Agent Research Pipelines | Hierarchical traces, TraceCollector, dashboard API |
| 3 | Automated Evaluation with LLM-as-Judge (this post) | 5-criteria scoring, quality trends |
| 4 | Token Cost Tracking & Budget Alerts | Per-model pricing, stage breakdown, budget alerts |
| 5 | Production Dashboard & Anomaly Detection | 4-panel dashboard, Z-score anomaly detection |
What Is LLM-as-Judge?#
LLM-as-judge is a pattern where you use one LLM to evaluate the output of another. The idea is simple: take your agent's output, send it to a "judge" model with a scoring rubric, and get structured scores back. It's not perfect — the judge model has its own biases and blind spots — but it's 500x cheaper than human review and runs in 1-2 seconds.
The key insight: you don't need a powerful model for judging. A research pipeline might use Claude Haiku 4.5 for synthesis, and the same Haiku model works fine as a judge. Evaluation is easier than generation — the judge just needs to read and score, not create from scratch. One Haiku call costs ~$0.001, so evaluating every single research run adds negligible cost.
When is LLM-as-judge useful?
- Development: Track quality across code changes. Did your new prompt template improve accuracy?
- Monitoring: Detect quality drops in production. If average scores fall from 3.8 to 2.5, something is wrong.
- Regression testing: Run the same queries after changes and compare scores.
- Cost-quality tradeoff: Correlate quality scores with token counts. Are you overpaying for marginal quality gains?
When is it not enough?
- Factual verification: The judge can't independently verify whether a cited statistic is correct. It can check if claims have citations, but not if those citations are accurate.
- Domain expertise: For medical, legal, or financial research, you need human experts in the loop.
- Adversarial inputs: If the agent is being deliberately misled, the judge may be fooled by the same inputs.
For our use case — monitoring a research agent during development — LLM-as-judge hits the sweet spot of cost, speed, and signal quality.
The Judge Prompt#
The judge prompt defines 5 criteria, each scored 1-5, with structured JSON output:
JUDGE_PROMPT = """\ You are a research quality evaluator. Score the following research report on 5 criteria. For each criterion, provide a score from 1-5 and a brief explanation (1 sentence). Criteria: 1. **Accuracy**: Are the facts correct and well-supported by sources? 2. **Completeness**: Does the report cover all aspects of the question? 3. **Citations**: Are sources properly cited with URLs? Are they credible? 4. **Coherence**: Is the report well-structured and easy to follow? 5. **Relevance**: Does the report actually answer the question asked? Output ONLY valid JSON in this format: { "accuracy": {"score": 4, "reason": "..."}, "completeness": {"score": 3, "reason": "..."}, "citations": {"score": 5, "reason": "..."}, "coherence": {"score": 4, "reason": "..."}, "relevance": {"score": 5, "reason": "..."}, "overall_score": 4.2, "summary": "Brief overall assessment in 1-2 sentences" } """
A few design choices:
- 5 criteria, not 1. A single "quality score" hides too much. A report with perfect accuracy but zero citations scores 5/5 on a single dimension but has a real problem. Separate dimensions let you spot specific failure modes.
- 1-5 scale. Enough granularity to track trends, not so much that the judge agonizes over the difference between a 7 and 8 on a 10-point scale.
- "reason" field. Forces the judge to justify its score, which improves scoring consistency (chain-of-thought for evaluation).
- JSON-only output. No preamble, no explanation outside the JSON. This makes parsing reliable — we just find the first
{and last}. overall_scorecomputed by the judge. Rather than averaging the 5 criteria ourselves, we let the judge compute a weighted overall score. The judge has context about which criteria matter more for this specific report.
The EvaluationResult Dataclass#
The evaluation result mirrors the JSON structure from the judge:
@dataclass class EvaluationResult: """Result from LLM-as-judge evaluation.""" accuracy: dict = field(default_factory=dict) completeness: dict = field(default_factory=dict) citations: dict = field(default_factory=dict) coherence: dict = field(default_factory=dict) relevance: dict = field(default_factory=dict) overall_score: float = 0.0 summary: str = "" evaluated_at: str = "" cost_tokens: int = 0 def to_dict(self) -> dict: return { "accuracy": self.accuracy, "completeness": self.completeness, "citations": self.citations, "coherence": self.coherence, "relevance": self.relevance, "overall_score": self.overall_score, "summary": self.summary, "evaluated_at": self.evaluated_at, "cost_tokens": self.cost_tokens, }
Each criterion field stores {"score": 4, "reason": "..."}. The cost_tokens field tracks how many tokens the evaluation itself consumed — meta-observability for the observability layer.
The evaluate_research() Function#
This is the core function. It takes a query, a research report, and the sources used, sends them to the judge model, and parses the structured scores:
async def evaluate_research(query: str, report: str, sources: list[dict]) -> EvaluationResult: """Evaluate a research report using LLM-as-judge. Sends the query, report, and sources to a judge model that scores accuracy, completeness, citations, coherence, and relevance. Cost: ~$0.001 per evaluation (one Haiku call). """ result = EvaluationResult(evaluated_at=datetime.now(timezone.utc).isoformat()) try: model = BedrockModel( model_id=settings.bedrock_model_id, region_name=settings.aws_region, ) judge = Agent( model=model, system_prompt=JUDGE_PROMPT, tools=[], callback_handler=None, ) source_list = "\n".join( f"- [{s.get('title', s.get('domain', 'Unknown'))}]({s.get('url', '')})" for s in sources[:8] ) eval_prompt = ( f"Evaluate this research report:\n\n" f"**Original Question**: {query}\n\n" f"**Report**:\n{report[:8000]}\n\n" f"**Sources Used**:\n{source_list}\n\n" f"Score each criterion 1-5 and output JSON only." ) judge_result = await judge.invoke_async(eval_prompt) response_text = str(judge_result) # Parse JSON from response start = response_text.index("{") end = response_text.rindex("}") + 1 scores = json.loads(response_text[start:end]) result.accuracy = scores.get("accuracy", {}) result.completeness = scores.get("completeness", {}) result.citations = scores.get("citations", {}) result.coherence = scores.get("coherence", {}) result.relevance = scores.get("relevance", {}) result.overall_score = scores.get("overall_score", 0.0) result.summary = scores.get("summary", "") # Estimate token cost from context.budget import count_tokens result.cost_tokens = count_tokens(eval_prompt) + count_tokens(response_text) except Exception as e: logger.warning("evaluation_failed", error=str(e)) result.summary = f"Evaluation failed: {str(e)}" return result
Key implementation details:
The Judge Agent Has No Tools#
judge = Agent( model=model, system_prompt=JUDGE_PROMPT, tools=[], # No tools — pure evaluation callback_handler=None, # No streaming — we just want the result )
The judge doesn't need web search, credibility scoring, or any other tool. It receives the report and sources as input and produces a JSON score. No tools means no tool descriptions in the context window (saving ~200 tokens) and no risk of the judge deciding to "verify" something by making additional calls.
Report Truncation#
f"**Report**:\n{report[:8000]}\n\n"
Reports are capped at 8,000 characters. A typical research report is 2,000-5,000 characters, so this rarely triggers. But for edge cases where synthesis produces a very long report, truncation prevents the evaluation from costing more than the research itself.
Source Limiting#
for s in sources[:8]
Only the first 8 sources are included. More sources mean more tokens, and the judge only needs to see enough to assess citation quality — not every single source.
JSON Parsing#
start = response_text.index("{") end = response_text.rindex("}") + 1 scores = json.loads(response_text[start:end])
Despite the "output ONLY valid JSON" instruction, models sometimes add a brief preamble or trailing comment. The index("{")/rindex("}") pattern extracts the JSON block regardless of surrounding text. This is more robust than strict parsing.
Graceful Failure#
except Exception as e: logger.warning("evaluation_failed", error=str(e)) result.summary = f"Evaluation failed: {str(e)}"
If the judge call fails — network error, JSON parse error, rate limiting — the evaluation returns a result with a zero score and an error message. The research pipeline continues normally. Evaluation is observability, not a gate — a failed evaluation should never block the user's research.
Integration in app.py#
The evaluation runs immediately after the research pipeline completes, as part of the "post-research observability" block:
# Evaluate quality with LLM-as-judge evaluation = {} if assistant_response: try: eval_result = await evaluate_research(message, assistant_response, assistant_sources) evaluation = eval_result.to_dict() await send_event({"type": "evaluation", "data": evaluation}) except Exception as e: logger.warning("evaluation_skipped", error=str(e))
The evaluation result is:
- Sent to the frontend via WebSocket as an
evaluationevent, so the UI can show the score in real-time. - Stored in the trace as part of the
TraceRecord, so the dashboard can display quality trends. - Fed to the anomaly detector, which checks if the quality score deviates significantly from the running average.
This three-way integration means quality data flows everywhere it's needed — the user sees it immediately, the dashboard tracks it historically, and the alerting system watches for regressions.
What Real Evaluations Look Like#
Here are evaluation scores from actual research runs during development:
| Query | Accuracy | Completeness | Citations | Coherence | Relevance | Overall |
|---|---|---|---|---|---|---|
| "Compare top 3 Python web frameworks for APIs" | 4 | 3 | 4 | 4 | 4 | 3.8 |
| "What is the current state of quantum computing?" | 4 | 4 | 3 | 4 | 5 | 4.0 |
| "Explain transformer architecture in detail" | 5 | 4 | 3 | 5 | 5 | 4.2 |
| "Latest developments in autonomous vehicles" | 3 | 3 | 2 | 4 | 4 | 3.2 |
| "What is Python?" (SIMPLE query) | 4 | 2 | 2 | 4 | 5 | 3.4 |
| "Pros and cons of microservices vs monoliths" | 4 | 4 | 4 | 4 | 5 | 4.2 |
Patterns that emerge:
- Citations are consistently the weakest criterion (avg 3.0). The research pipeline finds sources but doesn't always format them as inline citations in the report. This is actionable — we can improve the synthesis prompt to emphasize citation formatting.
- Relevance is consistently the strongest (avg 4.7). The pipeline stays on topic well, even for complex comparative queries.
- SIMPLE queries get lower completeness scores (2-3). By design — the context engine allocates fewer sub-queries and search results for simple queries. The evaluation confirms the trade-off: we're saving tokens at the cost of thoroughness for easy questions.
- Scores cluster in the 3.2-4.2 range. This is expected — the judge is calibrated to give 3 for "adequate" and 4 for "good." Scores below 2.5 or above 4.5 are rare and worth investigating.
Cost of Evaluation#
Each evaluation costs one Haiku call. Here's the math:
| Component | Tokens | Cost |
|---|---|---|
| System prompt (judge instructions) | ~180 | — |
| Eval prompt (query + report + sources) | ~1,200 | $0.00006 (input) |
| Judge response (JSON scores) | ~150 | $0.000019 (output) |
| Total per evaluation | ~1,530 | ~$0.0001 |
At $0.0001 per evaluation, you can evaluate 10,000 research runs for $1. Compare that to human review at $0.05-0.50 per evaluation — LLM-as-judge is 500-5,000x cheaper.
The evaluation adds ~1.5 seconds to each research pipeline (one Haiku call). For a pipeline that takes 10-20 seconds, this is a 7-15% overhead. If that's too much, you can make evaluation async (fire-and-forget) or sample (evaluate every Nth request). But for development and testing, evaluating every request gives you the highest signal.
Quality Trend Tracking#
The dashboard's /api/dashboard/evaluations endpoint returns evaluation scores across traces, enabling trend visualization:
@router.get("/evaluations") async def get_evaluations(limit: int = 20): """Get recent evaluation scores from traces.""" collector = get_collector() traces = collector._recent_traces[:limit] evaluations = [] for t in traces: if t.evaluation and t.evaluation.get("overall_score"): evaluations.append({ "trace_id": t.trace_id, "query": t.query[:80], "overall_score": t.evaluation.get("overall_score"), "accuracy": t.evaluation.get("accuracy", {}).get("score"), "completeness": t.evaluation.get("completeness", {}).get("score"), "citations": t.evaluation.get("citations", {}).get("score"), "coherence": t.evaluation.get("coherence", {}).get("score"), "relevance": t.evaluation.get("relevance", {}).get("score"), "evaluated_at": t.evaluation.get("evaluated_at", ""), "timestamp": t.start_time, }) return {"evaluations": evaluations}
The frontend QualityChart component plots these scores over time as a line chart. You can see at a glance whether quality is trending up (your prompt changes are working) or down (something regressed). Each data point is clickable — tap a score to jump to the full trace and see exactly what the agent did.
The anomaly detector also watches quality scores. If a research run scores 2.0 when the running average is 3.8, that's a quality anomaly — the alert panel flags it immediately, with the query and trace ID so you can investigate.
What's Next#
In Blog 4: Cost Tracking & Anomaly Detection, we'll build the per-request cost calculator — per-model pricing tables multiplied by actual token counts give you exact cost breakdowns by pipeline stage. Then we'll add Z-score anomaly detection that flags requests where latency, cost, or quality deviate more than 2 standard deviations from the running mean.
This project is a fork of the Context Engine, which is a fork of the Deep Research Agent — read those series first for the base pipeline.
All code is open source: github.com/MinhQuanBuiSco/agent-observability