Context Engine Series — Blog 4: Dynamic Tool Selection#

Every tool you give an AI agent costs tokens. Not just when the agent uses the tool -- the @tool docstring, parameter descriptions, and return type all get injected into the system prompt on every single invocation. With 30 tools, that is roughly 2,400 tokens of context consumed before the model even reads the user's question.

Most queries need one or two tools. "Explain what a transformer is" needs zero. "What happened in the stock market today" needs a search tool. Yet the naive approach packs every tool description into the prompt regardless, wasting context window space that could hold actual research findings.

In this post, we build a dynamic tool selector that classifies query intent, maps it to a minimal tool set, and slashes tool-description tokens by up to 100% for conceptual queries.

The Context Engine Series#

Part	Title	Focus
1	Architecture & Vision	System design, 6 techniques, pipeline overview
2	Query Complexity & Result Compression	Rule-based classifier, key sentence extraction
3	Semantic Memory & Research Caching	Jaccard similarity, local JSON cache
4	Dynamic Tool Selection (this post)	Intent classification, tool loadout
5	Source & Findings Deduplication	URL dedup, paragraph-level content hashing
6	Context X-ray Visualization	Real-time token tracking, WebSocket events

The Problem: Tool Descriptions Are Expensive#

When you define a tool with Strands SDK's @tool decorator, the entire docstring becomes part of the agent's system prompt. Here is what the model actually sees for our two research tools:

@tool
def capturing_tavily_search(query: str, max_results: int = 5) -> dict:
    """Search the web for information on a given query using Tavily API.

    Args:
        query: The search query string.
        max_results: Maximum number of results to return (default 5).

    Returns:
        A dict with 'results' containing title, url, content, and score
        for each result.
    """
    ...

@tool
def capturing_score_credibility(url: str, domain: str, content_snippet: str) -> dict:
    """Score the credibility of a source based on its domain and content.

    Args:
        url: The full URL of the source.
        domain: The domain name (e.g., 'reuters.com').
        content_snippet: A snippet of the source content for analysis.

    Returns:
        A dict with credibility score (0-100), tier, and reasoning.
    """
    ...

We measured the token cost of each description using tiktoken:

Tool	Tokens
`capturing_tavily_search`	~120
`capturing_score_credibility`	~100
Total (both tools)	~220

220 tokens does not sound like much. But remember -- each sub-query spawns its own researcher agent. A complex query decomposes into 4 sub-queries, so that is 4 agents x 220 = 880 tokens spent on tool descriptions alone. And in a production system with 30 tools, it becomes 2,400+ tokens per agent.

The question is: does every query actually need every tool?

Step 1: Classify Query Intent#

Not all queries are created equal. "Explain how a neural network works" does not need web search at all -- the LLM already knows the answer. "Compare React vs Vue in 2026" needs both search and credibility scoring. We capture this with a QueryIntent enum:

class QueryIntent(Enum):
    FACTUAL = "factual"          # Needs search for facts
    COMPARATIVE = "comparative"  # Needs search + credibility scoring
    CONCEPTUAL = "conceptual"    # Can answer from LLM knowledge
    CURRENT_EVENTS = "current_events"  # Needs heavy search (recent info)

The classifier uses keyword heuristics -- no LLM call, no latency, no token cost:

_COMPARATIVE_KEYWORDS = {
    "compare", "vs", "versus", "difference", "differences",
    "pros and cons", "advantages", "disadvantages", "trade-off",
    "better", "best", "which one", "ranking",
}
_CURRENT_KEYWORDS = {
    "latest", "recent", "2025", "2026", "new", "trending", "trends",
    "current", "today", "this year", "this month", "update",
}
_CONCEPTUAL_KEYWORDS = {
    "explain", "what is", "define", "concept", "theory",
    "how does", "why does", "meaning of", "introduction to",
}

def classify_intent(query: str) -> QueryIntent:
    """Classify query intent using keyword heuristics."""
    query_lower = query.lower()

    if any(kw in query_lower for kw in _COMPARATIVE_KEYWORDS):
        return QueryIntent.COMPARATIVE

    if any(kw in query_lower for kw in _CURRENT_KEYWORDS):
        return QueryIntent.CURRENT_EVENTS

    if any(kw in query_lower for kw in _CONCEPTUAL_KEYWORDS):
        if len(query.split()) > 15:
            return QueryIntent.FACTUAL
        return QueryIntent.CONCEPTUAL

    return QueryIntent.FACTUAL

Notice the guard on conceptual queries: if the question is long (over 15 words), it is likely complex enough to benefit from search, so we bump it to FACTUAL. "Explain quantum computing" stays conceptual; "Explain the differences between quantum computing approaches used in Google's latest Willow chip and IBM's Condor" becomes factual.

Step 2: Select Tools per Intent#

With intent classified, we map intent x complexity combinations to tool sets:

def select_tools(query: str, complexity: QueryComplexity) -> ToolSelection:
    intent = classify_intent(query)
    all_tools = ["capturing_tavily_search", "capturing_score_credibility"]

    if complexity == QueryComplexity.SIMPLE and intent == QueryIntent.CONCEPTUAL:
        selected = []
        excluded = all_tools
        reasoning = "Simple conceptual query -- LLM can answer directly without search"
        savings = sum(_TOOL_TOKEN_COSTS.values())  # 220 tokens saved

    elif intent == QueryIntent.FACTUAL:
        selected = ["capturing_tavily_search"]
        excluded = ["capturing_score_credibility"]
        reasoning = "Factual query -- search needed, credibility scoring skipped"
        savings = _TOOL_TOKEN_COSTS["capturing_score_credibility"]  # 100 tokens saved

    elif intent in (QueryIntent.COMPARATIVE, QueryIntent.CURRENT_EVENTS):
        selected = all_tools
        excluded = []
        reasoning = f"{intent.value} query -- all tools needed for thorough research"
        savings = 0

    return ToolSelection(
        intent=intent,
        selected=selected,
        excluded=excluded,
        reasoning=reasoning,
        estimated_token_savings=savings,
    )

The decision matrix:

Complexity	Intent	Tools Selected	Savings per Agent
SIMPLE	CONCEPTUAL	(none)	220 tokens
any	FACTUAL	tavily_search only	100 tokens
any	COMPARATIVE	all tools	0 tokens
any	CURRENT_EVENTS	all tools	0 tokens

The biggest win is the SIMPLE + CONCEPTUAL combination. The agent receives zero tool descriptions and answers purely from its training data. This saves 220 tokens per agent and avoids unnecessary Tavily API calls entirely.

The ToolSelection Dataclass#

The result carries everything the orchestrator and X-ray panel need:

@dataclass
class ToolSelection:
    """Result of dynamic tool selection."""

    intent: QueryIntent
    selected: list[str]     # Tool names to include
    excluded: list[str]     # Tool names excluded (for X-ray)
    reasoning: str
    estimated_token_savings: int  # Tokens saved from excluded tool descriptions

The excluded field is not just for logging -- it powers the Context X-ray visualization (Blog 6) so you can see exactly which tools were dropped and why.

Integration: The Orchestrator#

In the orchestrator, _research_sub_query accepts a tool_selection parameter and constructs each agent with only the selected tools:

async def _research_sub_query(
    sub_query: str,
    index: int,
    complexity=None,
    tool_selection: ToolSelection | None = None,
) -> ResearchResult:
    captured_sources: list[dict] = []
    search_tool, cred_tool = _create_source_capturing_tools(
        captured_sources, complexity=complexity
    )

    # Dynamic tool selection -- only include tools the selector chose
    tools = []
    if not tool_selection or "capturing_tavily_search" in tool_selection.selected:
        tools.append(search_tool)
    if not tool_selection or "capturing_score_credibility" in tool_selection.selected:
        tools.append(cred_tool)

    # Minimum safety: always allow search as fallback
    if not tools:
        tools = [search_tool]

    agent = Agent(
        model=_create_model(),
        system_prompt=prompt,
        tools=tools,           # Only selected tools here
        callback_handler=None,
    )

The key line is tools=tools -- instead of always passing [search_tool, cred_tool], we pass a filtered list. When tool_selection.selected is empty (conceptual query), we still include search as a safety fallback, but the agent will typically answer directly without calling it.

The orchestrator records the selection as a pipeline stage:

tool_selection = select_tools(query, context_engine.complexity.level)
event = context_engine.record_stage(
    name="tool_selection",
    tokens=sum(120 if t == "capturing_tavily_search" else 100
               for t in tool_selection.selected),
    naive_tokens=220,  # All tools always included in naive approach
    items=[
        {"label": "Intent", "value": tool_selection.intent.value},
        {"label": "Selected", "value": ", ".join(tool_selection.selected) or "(none)"},
        {"label": "Excluded", "value": ", ".join(tool_selection.excluded) or "(none)"},
        {"label": "Reasoning", "value": tool_selection.reasoning},
        {"label": "Token savings", "value": f"{tool_selection.estimated_token_savings} per agent"},
    ],
)

This event feeds directly into the Context X-ray panel, showing the intent classification, which tools were selected and excluded, and exactly how many tokens were saved.

Real-World Impact#

Here is what dynamic tool selection looks like in practice across different query types:

Query	Intent	Tools	Savings
"Explain backpropagation"	CONCEPTUAL	(none)	220 tokens/agent
"What is the GDP of Japan"	FACTUAL	search	100 tokens/agent
"Compare AWS Lambda vs ECS"	COMPARATIVE	all	0
"Latest AI agent frameworks 2026"	CURRENT_EVENTS	all	0

For a simple conceptual query with 1 sub-query, we save 220 tokens. For a moderate factual query with 2 sub-queries, we save 200 tokens (100 x 2 agents). The savings compound across the pipeline.

The classifier runs in microseconds -- it is just string matching against keyword sets. No LLM call, no API call, no latency. It is the cheapest optimization in the entire Context Engine, and one of the most effective on a per-query basis.

Design Trade-offs#

Why keyword heuristics instead of an LLM classifier? Using an LLM to classify intent would cost tokens to save tokens -- a circular problem. The keyword approach is fast, deterministic, and free. It misclassifies occasionally (a query about "the latest theory of everything" triggers CURRENT_EVENTS when it is really conceptual), but the fallback behavior is always safe: including more tools than needed wastes tokens but never breaks correctness.

Why keep search as a fallback even for conceptual queries? Because LLMs can be wrong. If the agent decides it needs to verify a fact, we want search available. The fallback adds 120 tokens to conceptual queries but prevents the rare case where the agent hallucinates without the ability to ground itself.

What's Next#

Tool selection reduces token waste at the description level. But what about the data itself? When 4 parallel researchers return their findings, the same source URLs and even identical paragraphs show up across multiple results. In Blog 5: Source & Findings Deduplication, we build a deduplication layer that uses URL matching and paragraph-level content hashing to prune redundant information before it reaches the synthesis prompt.

All code is open source: github.com/MinhQuanBuiSco/context-engine