Back
View source
AI Engineering··12 min

Aurora Market Series — Blog 3: The Router That Wouldn't Route + the Nemotron <think> Trap

The router returned valid JSON for two-thirds of inputs and a 23-character reply fragment for the rest. The fragment turned out to be Nemotron Nano's reasoning mode silently eating my max_tokens budget. Here's how I figured out the bug, the chat_template_kwargs fix, and the deterministic keyword backstop I added so the next model swap doesn't break the router again.

Aurora Market Series — Blog 3: The Router That Wouldn't Route + the Nemotron <think> Trap#

In Blog 2 I split the agent layer into four specialists and admitted the hard part was deciding which to run. This post is that hard part — and the bug that ate two evenings before I understood what was happening.

The job is small enough to fit in a paragraph. A user message comes in. A router LLM picks one to three specialists (search, recommend, promotion, post_purchase) and returns its choice as JSON. The orchestrator dispatches to each chosen agent, collects their outputs, and a composer LLM fuses them into one warm reply for the shopper.

In practice, two things broke.

The first was the router. Nemotron Nano returned valid JSON about two-thirds of the time. The other third, it returned fenced JSON, JSON wrapped in prose, JSON with leading whitespace, or — once — a polite refusal to choose. My json.loads would fail and the orchestrator would fall through to ["search"], which made the system silently single-agent for every request that needed multi-agent routing.

The second was worse. The composer reply, when it did stream, was sometimes a 23-character fragment — "I found some noise-canc" and then stop. The agent did the work. The products were correct. The reply just trailed off mid-word. I assumed max_tokens and bumped it to 2048. The fragment got longer by exactly two tokens. Something else was eating the budget.

Multi-agent chat reply with a promotion banner above the input — FRESHHOME15, 15% off all home goods, with an Apply button


The Aurora Market Series#

PartTitleFocus
1Architecture & The Agentic Commerce BetFour specialists, NIM as inference, ACP-style checkout
2Four Specialists, One Tool-Calling LoopThe base loop, per-agent prompts, cart context as a tool
3The Router That Wouldn't Route + the Nemotron <think> Trap (this post)LLM router + keyword backstop, reasoning-mode reply truncation, chat_template_kwargs fix
4Realtime: SSE, Live Agent Chips, and Token StreamsSSE-over-POST events, in-flight pills, React reducer pattern
5Generating the Catalog: Picsum → LoremFlickr → FLUX.1-schnellThree iterations of thumbnail accuracy
6Editorial Aesthetic for an AI StorefrontFraunces + Geist, clay + sage, agent chips as transparency

The Router's Job#

The router is the simplest LLM call in the whole system. One system prompt, no tools, JSON output, temperature=0.0. The prompt is exactly:

You are the routing brain of an agentic retail storefront. Choose
which specialist agents should run for this turn. Return STRICT JSON only, with
this shape:

{"agents": ["search"|"recommend"|"promotion"|"post_purchase", ...], "rationale": "..."}

Guidance:
- search: shopper is browsing or looking for something new ("show me", "I need", "do you have").
- recommend: cart-aware suggestions, complementary items, "what else", "anything to go with".
- promotion: discounts, codes, deals, "any promo", "make it cheaper".
- post_purchase: existing-order questions, shipping, tracking, returns.
Pick the minimum useful set. Usually one or two agents. Never more than three.

For obvious cases this works. "Show me ANC headphones" → {"agents": ["search"]}. "What's the tracking on my order?" → {"agents": ["post_purchase"]}.

For composite cases it gets shakier. "What else should I get to go with these for a long flight? Any promo I can stack?" should route to ["recommend", "promotion"]. About half the time, Nemotron Nano returns that. About a quarter, it returns ["search"] only — because "what else" is similar enough to "show me" that the model overweights the search heuristic. The remaining quarter, it returns valid JSON-ish output that my parser couldn't read.

There are three solvable problems and one harder one underneath. I'll start with the parser.


Problem 1: JSON That Isn't Quite JSON#

The model returns five distinct shapes:

  1. Pure JSON — what I asked for. json.loads works.
  2. Markdown-fenced JSON```json\n{...}\n```. Common when the model is being "helpful."
  3. JSON with leading whitespace or a Here you go: preamble. Less common but real.
  4. JSON inside a longer explanation — the model paraphrases the rationale into prose around the object.
  5. No JSON, just prose. Rare, but it happens.

The defensive parser strips fences, finds the first balanced {...} block, and parses that. It's twenty lines:

def _route(user_message: str, history: list[ChatMessage]) -> list[str]:
    hist = [{"role": m.role, "content": m.content} for m in history[-6:]]
    messages = [
        {"role": "system", "content": ROUTER_SYSTEM},
        *hist,
        {"role": "user", "content": user_message},
    ]
    kw = _keyword_route(user_message)
    try:
        resp = chat_completion(messages, temperature=0.0, max_tokens=200)
        text = (resp.choices[0].message.content or "").strip()

        # Strip markdown fences
        if text.startswith("```"):
            text = text.strip("`")
            if text.lower().startswith("json"):
                text = text[4:].lstrip()

        # Pluck the first {...} block out of any surrounding prose
        if not text.startswith("{"):
            start = text.find("{")
            end = text.rfind("}")
            if start != -1 and end != -1 and end > start:
                text = text[start : end + 1]

        parsed = json.loads(text)
        agents = [a for a in parsed.get("agents", []) if a in VALID_AGENTS]
        if not agents:
            return kw

        # Union with keyword hits so we don't miss obvious intents
        for a in kw:
            if a not in agents:
                agents.append(a)
        return agents
    except Exception as e:
        log.warning("router fallback (%s) → keyword route %s", e, kw)
        return kw

This caught shapes 1–4. Shape 5 — prose with no JSON anywhere — still has to fall through to a fallback. That fallback is the keyword router.


Problem 2: The Keyword Backstop#

The keyword router is deliberately not clever. It's twenty lines of if any(k in m for k in (...)):

def _keyword_route(msg: str) -> list[str]:
    m = msg.lower()
    chosen: list[str] = []
    if any(k in m for k in ("track", "tracking", "order status", "where is my", "shipping", "delivery", "return", "refund")):
        chosen.append("post_purchase")
    if any(k in m for k in ("promo", "discount", "deal", "sale", "coupon", "cheaper", "save")):
        chosen.append("promotion")
    if any(k in m for k in ("what else", "go with", "complement", "pair", "recommend", "suggest", "anything to add", "stack")):
        chosen.append("recommend")
    if not chosen or chosen in (["recommend"], ["promotion"], ["recommend", "promotion"]):
        if any(k in m for k in ("show", "find", "i need", "i want", "i'm looking", "do you have", "looking for", "search")):
            chosen.insert(0, "search")
    if not chosen:
        chosen = ["search"]
    seen: set[str] = set()
    return [a for a in chosen if not (a in seen or seen.add(a))]

Two design choices here. First, it always returns something — at minimum ["search"]. That matters because the LLM router's failure mode is most often "I gave you something unparseable"; if the fallback is a coin flip, the demo's worst-case quality drops sharply. A deterministic keyword route is a worse-but-predictable baseline.

Second — and this is the small idea I keep going back to in different projects — the LLM result and the keyword result are unioned, not chosen between:

# Union with keyword hits so we don't miss obvious intents
for a in kw:
    if a not in agents:
        agents.append(a)

If the model says ["search"] and the keyword router says ["promotion", "recommend"], we run all three. The cost of running an extra specialist is one more NIM call (~1–2 seconds, streaming, billed in cents); the cost of missing an intent is the shopper not seeing a promo they qualified for. The unioning is asymmetric on purpose.

This pattern shows up in three places in the project: in the router, in the search agent's "refine once and retry" loop, and in the promotion agent's "extract promo code from reply" heuristic. Two methods that mostly agree, run both, take the union. It's the cheapest reliability gain in the entire codebase.


Problem 3: The 23-Character Fragment#

This is the one that took the longest. Let me reproduce the bug.

User: "I need noise-cancelling headphones for travel under $300." Router: ["search"] — correct. Search agent: runs, returns six products with a 200-character text summary. Tool call telemetry shows the embedding query and the Milvus result. Correct. Composer: receives the agent's output, called with max_tokens=350, returns:

"I found some noise-canc"

23 characters. The composer is supposed to write a 3–6 sentence reply for the shopper. Instead, it's returning a quarter of a sentence.

I bumped max_tokens to 800. New output: "I found some noise-cancel" — 26 characters. Two tokens. The output isn't being truncated by my budget; the model is generating fewer tokens than the budget allows. Something else is consuming the budget before the answer starts.

I asked the API for usage. The completion token count was around 340. The visible text was 6 tokens. Somewhere, 330 tokens were being generated and discarded.

The answer is in Nemotron Nano 9B v2's chat template. Recent Nemotron variants ship with reasoning mode — they generate a <think>...</think> block before the answer, the chat template strips it, and only the post-think portion reaches your message.content. When the reasoning block is long enough, it eats your entire max_tokens budget. The model "answers," but the answer is "I found some noise-canc" because the budget ran out mid-sentence.

I confirmed by setting max_tokens=4096. The fragment grew to a full reply. The reasoning block had been ~1500 tokens.

That's a viable workaround for the composer, but expensive — every reply now pays for a long invisible reasoning trace. The right fix is to turn reasoning mode off for tasks that don't need it.


The chat_template_kwargs Fix#

Nemotron Nano 9B v2 exposes a chat-template-level toggle for reasoning. On the NIM endpoint, you pass it via OpenAI's extra_body:

resp = client.chat.completions.create(
    model="nvidia/nvidia-nemotron-nano-9b-v2",
    messages=messages,
    temperature=temperature,
    max_tokens=max_tokens,
    extra_body={"chat_template_kwargs": {"thinking": False}},
)

When the endpoint understands this, it bypasses the <think> step entirely. When it doesn't, it ignores the key. Either way the request is safe.

I also added a defensive <think>...</think> stripper to the client, because some Nemotron variants emit stray reasoning blocks even with the template toggle off:

_THINK_RE = re.compile(r"<think>.*?</think>", re.DOTALL | re.IGNORECASE)

def _strip_thought(text: str | None) -> str:
    if not text:
        return ""
    cleaned = _THINK_RE.sub("", text)
    if "<think>" in cleaned.lower():
        cleaned = cleaned.split("</think>", 1)[-1]
    return cleaned.strip()

The streaming path needs the same treatment chunk-by-chunk, with an in_think flag that swallows tokens between <think> and </think> delimiters. Blog 4 has that code.

After both fixes — chat_template_kwargs and the post-template stripper — the composer reliably returns the full 3–6 sentence reply within a 800-token budget. The pipeline cost per turn dropped because we're no longer paying for invisible reasoning trace tokens.

The lesson generalizes: reasoning-mode models are different products from the non-reasoning version of the same model, even when the name is the same. Test max_tokens behavior explicitly before assuming the OpenAI-compatible interface means OpenAI-compatible behavior.


What the Router Looks Like Now#

After all this, the router pipeline is:

user_message
    │
    ▼
keyword_route(msg)   ───────► kw = ["promotion", "recommend"]
    │
    ▼
chat_completion(ROUTER_SYSTEM, msg)    (Nemotron, temp=0, no thinking, JSON-shape prompt)
    │
    ▼
strip fences → pluck {...} → json.loads
    │
    ├── success: agents = parsed["agents"] ∩ VALID_AGENTS, then union with kw → return
    │
    └── exception: log warning, return kw

It's the most defensive piece of code in the project and worth every line. The cost of routing wrong is invisible to the user but visible in the output — the wrong specialists run, the composer makes do with thin agent text, and the reply is worse without anyone knowing why. Two-pass defense (LLM + keyword) for a four-way classification problem is overkill until the day it isn't.


A Note on JSON Mode and Function Calling#

NIM exposes function calling. NIM also supports an OpenAI-style response_format={"type": "json_object"} mode for some models. Why didn't I use either?

Function calling for routing is overhead. The router has no work to do other than return a label set. Wrapping that in a function-call schema means the model has to "decide to call" a single function whose argument shape is the schema you want — extra tokens, extra reasoning steps, no quality benefit over a clean JSON-output prompt. Function calling is the right primitive when the model is choosing among tools that do different things. For pure structured output, plain prompting plus defensive parsing is cheaper and more portable across models.

JSON mode (where it's available) is a fine alternative but doesn't help with the keyword union step — which is the part that actually improves reliability. The bottleneck isn't "model returns invalid JSON sometimes"; it's "model picks the wrong agents sometimes." That's an intent problem, not a syntax problem. Defensive parsing fixes the syntax case. Keyword union fixes the intent case. Both layers are doing different work.


What's Next#

The orchestrator now routes well, the composer no longer returns fragments, and we have full agent telemetry on the way back. The next post is about getting all of that onto the user's screen as it happens — Server-Sent Events over POST, the seven event types the orchestrator emits, a React reducer that mutates an assistant turn in place, and the difference between "spinner for 30 seconds" and "watch the agents work."