Back
AI Engineering··14 min

Agent Observability Series — Blog 4: Token Cost Tracking & Budget Alerts

One query costs $0.015. Another costs $0.08. Why? Build a cost tracker that breaks down every cent by pipeline stage, with daily budget limits and per-model Bedrock pricing.

Agent Observability Series — Blog 4: Token Cost Tracking & Budget Alerts#

One query costs $0.015. Another costs $0.08. Why? Without a cost tracker, you have no idea. You see a monthly Bedrock bill and shrug. But when you break down every token -- input vs. output, stage by stage, model by model -- the picture becomes sharp. The expensive query ran through 4 parallel researchers on Sonnet 4.6. The cheap one hit a memory cache and skipped research entirely. The cost tracker shows exactly where every token -- and every cent -- went.

In this post, we build a per-request cost tracking system with Bedrock pricing tables, per-stage breakdowns, daily budget limits, and a frontend component that visualizes it all.


The Agent Observability Series#

PartTitleFocus
1Architecture & OpenTelemetry FoundationsSystem design, gen_ai.* conventions, span model
2Tracing Multi-Agent Research PipelinesHierarchical traces, TraceCollector, dashboard API
3Automated Evaluation with LLM-as-Judge5-criteria scoring, quality trends
4Token Cost Tracking & Budget Alerts (this post)Per-model pricing, stage breakdown, budget alerts
5Production Dashboard & Anomaly Detection4-panel dashboard, Z-score anomaly detection

The Pricing Table#

Cost tracking starts with knowing what each model charges. Amazon Bedrock publishes per-token pricing for each Claude model. We encode these as a Python dictionary:

# Bedrock pricing per 1M tokens (input/output) as of 2026
# https://aws.amazon.com/bedrock/pricing/
MODEL_PRICING = {
    "us.anthropic.claude-haiku-4-5-20251001-v1:0": {
        "input_per_1m": 0.80,
        "output_per_1m": 4.00,
        "name": "Claude Haiku 4.5",
    },
    "us.anthropic.claude-sonnet-4-6-20260514-v1:0": {
        "input_per_1m": 3.00,
        "output_per_1m": 15.00,
        "name": "Claude Sonnet 4.6",
    },
    "us.anthropic.claude-opus-4-6-20260610-v1:0": {
        "input_per_1m": 15.00,
        "output_per_1m": 75.00,
        "name": "Claude Opus 4.6",
    },
}

# Default pricing for unknown models
DEFAULT_PRICING = {"input_per_1m": 1.00, "output_per_1m": 5.00, "name": "Unknown"}

Notice the asymmetry: output tokens cost 5x more than input tokens across all three models. This matters. A query that generates 2,000 output tokens on Opus costs $0.15 in output alone. The same query generating 500 output tokens costs $0.0375. The cost tracker captures both sides so you can see exactly where the money goes.

The DEFAULT_PRICING fallback ensures we never crash on an unknown model ID -- we just apply conservative defaults and log it for investigation.


Data Classes: CostEntry and RequestCost#

Each LLM call produces a CostEntry. An entire research request aggregates its entries into a RequestCost:

@dataclass
class CostEntry:
    """Cost record for a single LLM call."""
    stage: str
    model_id: str
    input_tokens: int = 0
    output_tokens: int = 0
    input_cost: float = 0.0
    output_cost: float = 0.0
    total_cost: float = 0.0
    timestamp: str = ""


@dataclass
class RequestCost:
    """Aggregated cost for an entire research request."""
    entries: list[CostEntry] = field(default_factory=list)
    total_input_tokens: int = 0
    total_output_tokens: int = 0
    total_cost: float = 0.0

    def to_dict(self) -> dict:
        return {
            "entries": [
                {
                    "stage": e.stage,
                    "model": MODEL_PRICING.get(e.model_id, DEFAULT_PRICING)["name"],
                    "input_tokens": e.input_tokens,
                    "output_tokens": e.output_tokens,
                    "cost": round(e.total_cost, 6),
                }
                for e in self.entries
            ],
            "total_input_tokens": self.total_input_tokens,
            "total_output_tokens": self.total_output_tokens,
            "total_cost": round(self.total_cost, 6),
            "cost_by_stage": self._cost_by_stage(),
        }

    def _cost_by_stage(self) -> dict:
        stages: dict[str, float] = {}
        for e in self.entries:
            stages[e.stage] = stages.get(e.stage, 0) + e.total_cost
        return {k: round(v, 6) for k, v in sorted(stages.items(), key=lambda x: x[1], reverse=True)}

CostEntry separates input_cost and output_cost because the pricing is asymmetric. A single Opus call that consumes 10,000 input tokens and generates 2,000 output tokens costs:

  • Input: (10,000 / 1,000,000) * $15.00 = $0.15
  • Output: (2,000 / 1,000,000) * $75.00 = $0.15
  • Total: $0.30

The 5:1 output-to-input ratio means equal token counts produce equal costs. But in practice, research queries have far more input tokens (context, instructions, search results) than output tokens (the model's response), so input costs dominate for most stages.

RequestCost.to_dict() serializes everything for the frontend and also computes cost_by_stage -- a dictionary mapping stage names to their total cost, sorted from most to least expensive. This drives the stage breakdown chart in the dashboard.


The CostTracker Class#

The CostTracker manages per-request tracking, maintains a history of past requests, and enforces a daily budget:

class CostTracker:
    """Tracks token costs per request and maintains history."""

    def __init__(self):
        self._current: RequestCost | None = None
        self._history: list[dict] = []
        self._daily_total: float = 0.0
        self._budget_limit: float = 10.0  # $10/day default

    def start_request(self) -> RequestCost:
        """Start tracking a new request."""
        self._current = RequestCost()
        return self._current

The singleton pattern ensures a single tracker across the application:

_tracker = None

def get_cost_tracker() -> "CostTracker":
    """Get the singleton cost tracker."""
    global _tracker
    if _tracker is None:
        _tracker = CostTracker()
    return _tracker

Every research request follows the same lifecycle: start_request() -> multiple record_call() -> finish_request(). The tracker accumulates entries during the request and commits the total to history when finished.


Recording Calls: The Core Logic#

record_call() is where pricing meets token counts:

def record_call(self, stage: str, model_id: str, input_tokens: int, output_tokens: int) -> CostEntry:
    """Record a single LLM call's cost."""
    pricing = MODEL_PRICING.get(model_id, DEFAULT_PRICING)

    entry = CostEntry(
        stage=stage,
        model_id=model_id,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        input_cost=(input_tokens / 1_000_000) * pricing["input_per_1m"],
        output_cost=(output_tokens / 1_000_000) * pricing["output_per_1m"],
        timestamp=datetime.now(timezone.utc).isoformat(),
    )
    entry.total_cost = entry.input_cost + entry.output_cost

    if self._current:
        self._current.entries.append(entry)
        self._current.total_input_tokens += input_tokens
        self._current.total_output_tokens += output_tokens
        self._current.total_cost += entry.total_cost

    self._daily_total += entry.total_cost

    logger.debug(
        "cost_recorded",
        stage=stage,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        cost=round(entry.total_cost, 6),
    )

    return entry

The formula is straightforward: (tokens / 1,000,000) * price_per_1m. Bedrock prices are quoted per million tokens, so we divide by one million. The cost is computed separately for input and output because they have different rates.

Each call also increments _daily_total, which accumulates independently of individual requests. This is the running total used for budget enforcement.

Finishing a Request#

When the research pipeline completes, finish_request() serializes the current request and pushes it to history:

def finish_request(self, query: str = "") -> dict:
    """Finish tracking the current request and save to history."""
    if not self._current:
        return {}

    result = self._current.to_dict()
    result["query"] = query[:80]
    result["timestamp"] = datetime.now(timezone.utc).isoformat()

    self._history.insert(0, result)
    if len(self._history) > 200:
        self._history = self._history[:200]

    self._current = None
    return result

The history caps at 200 entries to prevent unbounded memory growth. The query is truncated to 80 characters -- enough context for the "Most Expensive" list without storing full user queries in memory.


Integration: Tracking Cost per Pipeline Stage#

In app.py, the cost tracker wraps each pipeline stage's token usage. The context engine already tracks token counts per stage, so we read them directly:

from observability.cost_tracker import get_cost_tracker

# After research completes
cost_tracker = get_cost_tracker()
cost_tracker.start_request()
for stage in context_engine.stages:
    if stage.tokens > 0:
        # Estimate input/output split (roughly 75% input, 25% output)
        input_t = int(stage.tokens * 0.75)
        output_t = stage.tokens - input_t
        cost_tracker.record_call(stage.name, settings.bedrock_model_id, input_t, output_t)
request_cost = cost_tracker.finish_request(query=message)

Each context engine stage -- memory_lookup, query_decomposition, tool_selection, parallel_research, deduplication, critique, synthesis -- becomes a cost entry. The 75/25 input/output split is an approximation: research queries are overwhelmingly input-heavy (prompts + search results), with relatively short model outputs (decomposed queries, critiques, synthesized answers).

For a production system, you would instrument each Bedrock call directly and capture exact input/output token counts from the API response. Our approach trades precision for simplicity -- the approximation is close enough for cost analysis and budget tracking.


The cost_comparison WebSocket Event#

After computing cost, the server sends a cost_comparison event to the frontend:

await send_event({
    "type": "cost_comparison",
    "engineered_tokens": engineered_tokens,
    "naive_tokens": naive_tokens,
    "savings_percent": round((1 - engineered_tokens / max(naive_tokens, 1)) * 100, 1),
    "total_cost": request_cost.get("total_cost", 0),
})

This event carries both the token savings from context engineering and the dollar cost of the request. The frontend uses this to show a combined view: "You used 12,847 tokens (saved 34.5%) and spent $0.031."


Budget Management#

The CostTracker includes simple but effective budget controls:

def is_over_budget(self) -> bool:
    """Check if daily spend exceeds budget."""
    return self._daily_total > self._budget_limit

def get_summary(self) -> dict:
    """Get cost summary statistics."""
    if not self._history:
        return {"total_requests": 0, "daily_total": 0}

    costs = [h["total_cost"] for h in self._history]
    return {
        "total_requests": len(self._history),
        "daily_total": round(self._daily_total, 4),
        "budget_limit": self._budget_limit,
        "budget_remaining": round(self._budget_limit - self._daily_total, 4),
        "avg_cost": round(sum(costs) / len(costs), 6),
        "max_cost": round(max(costs), 6),
        "min_cost": round(min(costs), 6),
    }

The default budget is $10/day. For a Haiku-based pipeline, that is roughly 250-650 research queries depending on complexity. is_over_budget() is checked before accepting new queries -- if the daily spend exceeds the limit, the agent can refuse the request or switch to a cheaper model.

The summary statistics (avg_cost, max_cost, min_cost) help you understand cost distribution. If max_cost is 10x your avg_cost, you have an outlier problem -- certain queries trigger expensive code paths that deserve investigation.


Dashboard: The CostBreakdown Component#

The frontend CostBreakdown component renders three things: a budget progress bar, a stage breakdown chart, and a list of the most expensive queries.

Budget Bar#

The budget bar changes color based on spend level -- green below 50%, amber between 50-80%, red above 80%:

const budgetPercent = useMemo(() => {
  if (!data?.summary) return 0;
  return (data.summary.daily_total / data.summary.budget_limit) * 100;
}, [data]);

const budgetColor = useMemo(() => {
  if (budgetPercent > 80) return 'from-red-500/80 to-red-500/60';
  if (budgetPercent > 50) return 'from-amber-500/80 to-amber-500/60';
  return 'from-green-500/80 to-green-500/60';
}, [budgetPercent]);

The bar itself is an animated motion.div that fills from left to right:

<div className="h-2 w-full rounded-full bg-secondary overflow-hidden">
  <motion.div
    initial={{ width: 0 }}
    animate={{ width: `${Math.min(budgetPercent, 100)}%` }}
    transition={{ duration: 0.6 }}
    className={`h-full rounded-full bg-gradient-to-r ${budgetColor}`}
  />
</div>
<div className="mt-1.5 flex justify-between text-[10px] font-mono text-muted-foreground">
  <span>avg {formatCost(data.summary.avg_cost)} / req</span>
  <span>{formatCost(data.summary.budget_remaining)} remaining</span>
</div>

At a glance, you know whether your daily spend is healthy. The "avg per req" and "remaining" labels provide quick mental math -- if the average is $0.03 and you have $8.50 remaining, you have roughly 280 queries left.

Stage Breakdown#

The stage breakdown is a stacked horizontal bar showing cost proportion by pipeline stage:

const stageBreakdown = useMemo(() => {
  if (!data?.history || data.history.length === 0) return [];
  const totals: Record<string, number> = {};
  for (const entry of data.history) {
    for (const [stage, cost] of Object.entries(entry.cost_by_stage)) {
      totals[stage] = (totals[stage] || 0) + cost;
    }
  }
  const grand = Object.values(totals).reduce((a, b) => a + b, 0);
  return Object.entries(totals)
    .sort(([, a], [, b]) => b - a)
    .map(([stage, cost]) => ({
      stage,
      cost,
      percent: grand > 0 ? (cost / grand) * 100 : 0,
    }));
}, [data]);

Each stage gets a color: blue for research, primary for synthesis, amber for critique, purple for evaluation. The stacked bar animates with staggered delays -- each segment grows from zero width with a 100ms offset, so the bar fills in left to right.

Below the bar, a legend maps colors to stage names with their dollar costs.

Top Expensive Queries#

The bottom section shows the 5 most expensive queries, sorted by total cost:

const sortedHistory = useMemo(() => {
  if (!data?.history) return [];
  return [...data.history]
    .sort((a, b) => b.total_cost - a.total_cost)
    .slice(0, 5);
}, [data]);

Each entry shows the cost, truncated query text, and timestamp. This is the quickest way to spot outliers -- if one query costs $0.08 while the rest cost $0.02, you click through to investigate why.


Real Numbers from Our Runs#

Here is a cost breakdown from actual test queries running Claude Haiku 4.5 on Bedrock:

QueryTokensCostMost Expensive Stage
"What is quantum computing?"4,200$0.0147parallel_research ($0.0089)
"Compare React, Vue, and Angular"12,800$0.0312parallel_research ($0.0178)
"Explain the history of AI from 1950 to 2025"18,400$0.0521synthesis ($0.0214)
"How does TCP/IP work?"6,100$0.0183parallel_research ($0.0112)
"Compare cloud providers AWS vs Azure vs GCP for ML workloads"22,300$0.0789parallel_research ($0.0398)

Patterns that emerge:

  1. Parallel research dominates cost. It accounts for 50-60% of total cost on most queries because it dispatches multiple sub-queries, each involving Bedrock calls and search results.

  2. Complex comparison queries cost 3-5x simple factual queries. More sub-queries = more parallel researchers = more tokens = more cost.

  3. Synthesis costs scale with input size. The longest queries produce the most findings, which means a larger synthesis prompt. The history-of-AI query had 18,400 tokens of findings to synthesize.

  4. Memory cache hits dramatically reduce cost. A repeated query that hits the semantic memory cache skips parallel research entirely, dropping cost by 60-70%.

At Haiku 4.5 pricing ($0.80/$4.00 per 1M tokens), these costs are modest. But switch to Sonnet 4.6 (3.75x more expensive) or Opus 4.6 (18.75x more expensive), and the same queries cost proportionally more. The cost tracker makes this visible so you can make informed model choices per stage.


What Comes Next#

Cost tracking tells you how much you are spending. But it does not tell you when something is wrong. A query that usually costs $0.02 suddenly costs $0.08 -- is that a real anomaly or normal variance? In Blog 5: Production Dashboard & Anomaly Detection, we build a Z-score anomaly detector that flags cost and latency spikes automatically, and a 4-panel React dashboard that shows your agent's health at a glance.


All code is open source: github.com/MinhQuanBuiSco/agent-observability