Tool Hints vs Model Size: What Actually Makes SLMs Call Tools Correctly
Enum constraints on tool descriptions improved a 1.5B model's accuracy more than upgrading to 7B. Findings from building MCP servers with Ollama.
By Krunal Sabnis
The Setup
We built two MCP servers — Google Calendar (6 tools) and Vault search (4 tools) — and wired them to a local 1.5B parameter model (qwen2.5:1.5b) via Ollama’s tool calling API.
The question: when a small model fails to call the right tool, is the fix a bigger model or better tool descriptions?
We tested both.
The Experiment
Task: “What’s on my zeoxia calendar today?”
The correct behavior: call list_events(account="zeoxia", date="2026-03-06").
Two tool configurations, same model, same prompt.
Version A: Minimal Descriptions
{
"name": "list_events",
"description": "List calendar events for a specific date",
"parameters": {
"account": { "type": "string", "description": "Account name" },
"date": { "type": "string", "description": "Date in YYYY-MM-DD format" }
}
}
Result: Model didn’t call any tool. Responded: “Could you please provide me with the account name and the specific date?” — even though both were in the user’s message.
Version B: Aggressive Hints
{
"name": "list_events",
"description": "Get calendar events. Known accounts: personal, zeoxia, greenpill, aster, neurelay, persona. Call this directly when user mentions an account.",
"parameters": {
"account": {
"type": "string",
"enum": ["personal", "zeoxia", "greenpill", "aster", "neurelay", "persona"]
},
"date": {
"type": "string",
"description": "YYYY-MM-DD. Today is 2026-03-06."
}
}
}
Result: Correct tool call on first turn. list_events("zeoxia", "2026-03-06").
What Changed
Three things, ranked by impact:
1. Enum constraints (highest impact)
"enum": ["personal", "zeoxia", "greenpill", "aster", "neurelay", "persona"]
Without this, the 1.5B model didn’t recognise “zeoxia” as a valid account name — it’s not a common English word, so the model had no confidence to use it as a parameter value. With enum, it’s no longer guessing. It’s pattern matching against a closed set.
This is the single biggest lever. A model doesn’t need to “understand” your domain if you enumerate the valid values.
2. Negative guidance on adjacent tools
"Only if user asks which accounts exist. NOT needed if they already named one."
Small models are cautious. Given two tools — list_accounts and list_events — a 1.5B model will often call the safer, information-gathering one first. Explicitly telling it when not to use a tool prevents this unnecessary round-trip.
3. Embedding runtime context in descriptions
"YYYY-MM-DD. Today is 2026-03-06."
The system prompt said today’s date, but the model still struggled to use it when filling tool parameters. Repeating the date inside the parameter description puts it right where the model needs it — at decision time, not three message turns ago.
The Results
| Configuration | Tool called | Correct? | Rounds to answer |
|---|---|---|---|
| Minimal hints | (none) | No | Gave up, asked user |
| Minimal hints (run 2) | list_events("", "2026-03-06") | No | Empty account |
| Minimal hints (run 3) | list_accounts() | Wrong tool | 2+ rounds, never finished |
| Aggressive hints | list_events("zeoxia", "2026-03-06") | Yes | 2 rounds (tool + answer) |
Same 1.5B model. Same prompt. Same tools. Only the descriptions changed.
The Mental Model
Think of tool descriptions as a contract surface between your system and the model. For large models (70B+, or cloud APIs like GPT-4/Claude), the contract can be loose — the model fills in gaps with reasoning. For small models, the contract must be tight:
Large model: loose description + reasoning = correct call
Small model: loose description + no reasoning = wrong call
Small model: tight description + no reasoning = correct call
The practical implication: tool hint quality is a multiplier on model capability. A well-described tool on a 1.5B model outperforms a poorly-described tool on a 7B model.
What About Just Using a Bigger Model?
We also tested qwen2.5:7b. It handles minimal descriptions correctly about 70% of the time. With aggressive hints, it’s near 100%.
The tradeoff:
- 7B model: 4GB VRAM, ~2s latency per call
- 1.5B model: 1GB VRAM, ~0.5s latency per call
If you can get 1.5B working reliably with better hints, you get 4x faster responses and can run on a Raspberry Pi. That’s worth writing better descriptions for.
Checklist: Writing Tool Descriptions for Small Models
-
Use
enumfor every parameter with known values. Don’t make the model guess. If there are 6 valid accounts, list all 6. -
Embed dynamic context in parameter descriptions. Today’s date, the user’s name, the current timezone — put it where it’s used, not just in the system prompt.
-
Add negative guidance to similar tools. If you have
searchandlist, tell each one when not to be used. Small models can’t infer this. -
Put the action verb first in descriptions. “Get calendar events” not “This tool can be used to retrieve calendar events from Google Calendar.” Small context windows need density.
-
Include an example in the description if the format is ambiguous.
"Date in YYYY-MM-DD format, e.g. 2026-03-06"removes one more decision from the model. -
Reduce tool count. Every tool in the context competes for the model’s attention. If you have 10 tools but the user will only need 3 in a given conversation, filter them.
Implications for MCP Server Design
If you’re building MCP servers meant to be consumed by models of varying sizes:
- Don’t assume a large model. Your MCP server will be called by Claude, GPT-4, and also by someone’s local qwen2.5:1.5b. The descriptions need to work for all of them.
- Enums are not optional. They’re the difference between working and not working on small models.
- Tool descriptions are a UX surface. Treat them like API docs that will be read by a distracted junior developer with no domain context. That’s roughly what a 1.5B model is.
Conclusion
Model size and tool hint quality are both levers, but they’re not equal:
- Upgrading from 1.5B to 7B: moderate improvement, 4x the cost
- Upgrading tool descriptions: dramatic improvement, zero cost
Write your tool descriptions for the smallest model you want to support. Every model benefits. No model is harmed. The ten minutes you spend on better enum values and negative guidance will save you from debugging wrong tool calls forever.
Tested with: Ollama 0.6+, qwen2.5:1.5b, qwen2.5:7b, FastMCP 1.0. March 2026.
Working on a similar challenge?
Let's talk