Context Engine Series — Blog 4: Dynamic Tool Selection#
Every tool you give an AI agent costs tokens. Not just when the agent uses the tool -- the @tool docstring, parameter descriptions, and return type all get injected into the system prompt on every single invocation. With 30 tools, that is roughly 2,400 tokens of context consumed before the model even reads the user's question.
Most queries need one or two tools. "Explain what a transformer is" needs zero. "What happened in the stock market today" needs a search tool. Yet the naive approach packs every tool description into the prompt regardless, wasting context window space that could hold actual research findings.
In this post, we build a dynamic tool selector that classifies query intent, maps it to a minimal tool set, and slashes tool-description tokens by up to 100% for conceptual queries.
The Context Engine Series#
| Part | Title | Focus |
|---|---|---|
| 1 | Architecture & Vision | System design, 6 techniques, pipeline overview |
| 2 | Query Complexity & Result Compression | Rule-based classifier, key sentence extraction |
| 3 | Semantic Memory & Research Caching | Jaccard similarity, local JSON cache |
| 4 | Dynamic Tool Selection (this post) | Intent classification, tool loadout |
| 5 | Source & Findings Deduplication | URL dedup, paragraph-level content hashing |
| 6 | Context X-ray Visualization | Real-time token tracking, WebSocket events |
The Problem: Tool Descriptions Are Expensive#
When you define a tool with Strands SDK's @tool decorator, the entire docstring becomes part of the agent's system prompt. Here is what the model actually sees for our two research tools:
@tool def capturing_tavily_search(query: str, max_results: int = 5) -> dict: """Search the web for information on a given query using Tavily API. Args: query: The search query string. max_results: Maximum number of results to return (default 5). Returns: A dict with 'results' containing title, url, content, and score for each result. """ ... @tool def capturing_score_credibility(url: str, domain: str, content_snippet: str) -> dict: """Score the credibility of a source based on its domain and content. Args: url: The full URL of the source. domain: The domain name (e.g., 'reuters.com'). content_snippet: A snippet of the source content for analysis. Returns: A dict with credibility score (0-100), tier, and reasoning. """ ...
We measured the token cost of each description using tiktoken:
| Tool | Tokens |
|---|---|
capturing_tavily_search | ~120 |
capturing_score_credibility | ~100 |
| Total (both tools) | ~220 |
220 tokens does not sound like much. But remember -- each sub-query spawns its own researcher agent. A complex query decomposes into 4 sub-queries, so that is 4 agents x 220 = 880 tokens spent on tool descriptions alone. And in a production system with 30 tools, it becomes 2,400+ tokens per agent.
The question is: does every query actually need every tool?
Step 1: Classify Query Intent#
Not all queries are created equal. "Explain how a neural network works" does not need web search at all -- the LLM already knows the answer. "Compare React vs Vue in 2026" needs both search and credibility scoring. We capture this with a QueryIntent enum:
class QueryIntent(Enum): FACTUAL = "factual" # Needs search for facts COMPARATIVE = "comparative" # Needs search + credibility scoring CONCEPTUAL = "conceptual" # Can answer from LLM knowledge CURRENT_EVENTS = "current_events" # Needs heavy search (recent info)
The classifier uses keyword heuristics -- no LLM call, no latency, no token cost:
_COMPARATIVE_KEYWORDS = { "compare", "vs", "versus", "difference", "differences", "pros and cons", "advantages", "disadvantages", "trade-off", "better", "best", "which one", "ranking", } _CURRENT_KEYWORDS = { "latest", "recent", "2025", "2026", "new", "trending", "trends", "current", "today", "this year", "this month", "update", } _CONCEPTUAL_KEYWORDS = { "explain", "what is", "define", "concept", "theory", "how does", "why does", "meaning of", "introduction to", } def classify_intent(query: str) -> QueryIntent: """Classify query intent using keyword heuristics.""" query_lower = query.lower() if any(kw in query_lower for kw in _COMPARATIVE_KEYWORDS): return QueryIntent.COMPARATIVE if any(kw in query_lower for kw in _CURRENT_KEYWORDS): return QueryIntent.CURRENT_EVENTS if any(kw in query_lower for kw in _CONCEPTUAL_KEYWORDS): if len(query.split()) > 15: return QueryIntent.FACTUAL return QueryIntent.CONCEPTUAL return QueryIntent.FACTUAL
Notice the guard on conceptual queries: if the question is long (over 15 words), it is likely complex enough to benefit from search, so we bump it to FACTUAL. "Explain quantum computing" stays conceptual; "Explain the differences between quantum computing approaches used in Google's latest Willow chip and IBM's Condor" becomes factual.
Step 2: Select Tools per Intent#
With intent classified, we map intent x complexity combinations to tool sets:
def select_tools(query: str, complexity: QueryComplexity) -> ToolSelection: intent = classify_intent(query) all_tools = ["capturing_tavily_search", "capturing_score_credibility"] if complexity == QueryComplexity.SIMPLE and intent == QueryIntent.CONCEPTUAL: selected = [] excluded = all_tools reasoning = "Simple conceptual query -- LLM can answer directly without search" savings = sum(_TOOL_TOKEN_COSTS.values()) # 220 tokens saved elif intent == QueryIntent.FACTUAL: selected = ["capturing_tavily_search"] excluded = ["capturing_score_credibility"] reasoning = "Factual query -- search needed, credibility scoring skipped" savings = _TOOL_TOKEN_COSTS["capturing_score_credibility"] # 100 tokens saved elif intent in (QueryIntent.COMPARATIVE, QueryIntent.CURRENT_EVENTS): selected = all_tools excluded = [] reasoning = f"{intent.value} query -- all tools needed for thorough research" savings = 0 return ToolSelection( intent=intent, selected=selected, excluded=excluded, reasoning=reasoning, estimated_token_savings=savings, )
The decision matrix:
| Complexity | Intent | Tools Selected | Savings per Agent |
|---|---|---|---|
| SIMPLE | CONCEPTUAL | (none) | 220 tokens |
| any | FACTUAL | tavily_search only | 100 tokens |
| any | COMPARATIVE | all tools | 0 tokens |
| any | CURRENT_EVENTS | all tools | 0 tokens |
The biggest win is the SIMPLE + CONCEPTUAL combination. The agent receives zero tool descriptions and answers purely from its training data. This saves 220 tokens per agent and avoids unnecessary Tavily API calls entirely.
The ToolSelection Dataclass#
The result carries everything the orchestrator and X-ray panel need:
@dataclass class ToolSelection: """Result of dynamic tool selection.""" intent: QueryIntent selected: list[str] # Tool names to include excluded: list[str] # Tool names excluded (for X-ray) reasoning: str estimated_token_savings: int # Tokens saved from excluded tool descriptions
The excluded field is not just for logging -- it powers the Context X-ray visualization (Blog 6) so you can see exactly which tools were dropped and why.
Integration: The Orchestrator#
In the orchestrator, _research_sub_query accepts a tool_selection parameter and constructs each agent with only the selected tools:
async def _research_sub_query( sub_query: str, index: int, complexity=None, tool_selection: ToolSelection | None = None, ) -> ResearchResult: captured_sources: list[dict] = [] search_tool, cred_tool = _create_source_capturing_tools( captured_sources, complexity=complexity ) # Dynamic tool selection -- only include tools the selector chose tools = [] if not tool_selection or "capturing_tavily_search" in tool_selection.selected: tools.append(search_tool) if not tool_selection or "capturing_score_credibility" in tool_selection.selected: tools.append(cred_tool) # Minimum safety: always allow search as fallback if not tools: tools = [search_tool] agent = Agent( model=_create_model(), system_prompt=prompt, tools=tools, # Only selected tools here callback_handler=None, )
The key line is tools=tools -- instead of always passing [search_tool, cred_tool], we pass a filtered list. When tool_selection.selected is empty (conceptual query), we still include search as a safety fallback, but the agent will typically answer directly without calling it.
The orchestrator records the selection as a pipeline stage:
tool_selection = select_tools(query, context_engine.complexity.level) event = context_engine.record_stage( name="tool_selection", tokens=sum(120 if t == "capturing_tavily_search" else 100 for t in tool_selection.selected), naive_tokens=220, # All tools always included in naive approach items=[ {"label": "Intent", "value": tool_selection.intent.value}, {"label": "Selected", "value": ", ".join(tool_selection.selected) or "(none)"}, {"label": "Excluded", "value": ", ".join(tool_selection.excluded) or "(none)"}, {"label": "Reasoning", "value": tool_selection.reasoning}, {"label": "Token savings", "value": f"{tool_selection.estimated_token_savings} per agent"}, ], )
This event feeds directly into the Context X-ray panel, showing the intent classification, which tools were selected and excluded, and exactly how many tokens were saved.
Real-World Impact#
Here is what dynamic tool selection looks like in practice across different query types:
| Query | Intent | Tools | Savings |
|---|---|---|---|
| "Explain backpropagation" | CONCEPTUAL | (none) | 220 tokens/agent |
| "What is the GDP of Japan" | FACTUAL | search | 100 tokens/agent |
| "Compare AWS Lambda vs ECS" | COMPARATIVE | all | 0 |
| "Latest AI agent frameworks 2026" | CURRENT_EVENTS | all | 0 |
For a simple conceptual query with 1 sub-query, we save 220 tokens. For a moderate factual query with 2 sub-queries, we save 200 tokens (100 x 2 agents). The savings compound across the pipeline.
The classifier runs in microseconds -- it is just string matching against keyword sets. No LLM call, no API call, no latency. It is the cheapest optimization in the entire Context Engine, and one of the most effective on a per-query basis.
Design Trade-offs#
Why keyword heuristics instead of an LLM classifier? Using an LLM to classify intent would cost tokens to save tokens -- a circular problem. The keyword approach is fast, deterministic, and free. It misclassifies occasionally (a query about "the latest theory of everything" triggers CURRENT_EVENTS when it is really conceptual), but the fallback behavior is always safe: including more tools than needed wastes tokens but never breaks correctness.
Why keep search as a fallback even for conceptual queries? Because LLMs can be wrong. If the agent decides it needs to verify a fact, we want search available. The fallback adds 120 tokens to conceptual queries but prevents the rare case where the agent hallucinates without the ability to ground itself.
What's Next#
Tool selection reduces token waste at the description level. But what about the data itself? When 4 parallel researchers return their findings, the same source URLs and even identical paragraphs show up across multiple results. In Blog 5: Source & Findings Deduplication, we build a deduplication layer that uses URL matching and paragraph-level content hashing to prune redundant information before it reaches the synthesis prompt.
All code is open source: github.com/MinhQuanBuiSco/context-engine