Tool Hints vs Model Size: What Actually Makes SLMs Call Tools Correctly

The Setup

We built two MCP servers — Google Calendar (6 tools) and Vault search (4 tools) — and wired them to a local 1.5B parameter model (qwen2.5:1.5b) via Ollama’s tool calling API.

The question: when a small model fails to call the right tool, is the fix a bigger model or better tool descriptions?

We tested both.

The Experiment

Task: “What’s on my zeoxia calendar today?”

The correct behavior: call list_events(account="zeoxia", date="2026-03-06").

Two tool configurations, same model, same prompt.

Version A: Minimal Descriptions

{
  "name": "list_events",
  "description": "List calendar events for a specific date",
  "parameters": {
    "account": { "type": "string", "description": "Account name" },
    "date": { "type": "string", "description": "Date in YYYY-MM-DD format" }
  }
}

Result: Model didn’t call any tool. Responded: “Could you please provide me with the account name and the specific date?” — even though both were in the user’s message.

Version B: Aggressive Hints

{
  "name": "list_events",
  "description": "Get calendar events. Known accounts: personal, zeoxia, greenpill, aster, neurelay, persona. Call this directly when user mentions an account.",
  "parameters": {
    "account": {
      "type": "string",
      "enum": ["personal", "zeoxia", "greenpill", "aster", "neurelay", "persona"]
    },
    "date": {
      "type": "string",
      "description": "YYYY-MM-DD. Today is 2026-03-06."
    }
  }
}

Result: Correct tool call on first turn. list_events("zeoxia", "2026-03-06").

What Changed

Three things, ranked by impact:

1. Enum constraints (highest impact)

"enum": ["personal", "zeoxia", "greenpill", "aster", "neurelay", "persona"]

Without this, the 1.5B model didn’t recognise “zeoxia” as a valid account name — it’s not a common English word, so the model had no confidence to use it as a parameter value. With enum, it’s no longer guessing. It’s pattern matching against a closed set.

This is the single biggest lever. A model doesn’t need to “understand” your domain if you enumerate the valid values.

2. Negative guidance on adjacent tools

"Only if user asks which accounts exist. NOT needed if they already named one."

Small models are cautious. Given two tools — list_accounts and list_events — a 1.5B model will often call the safer, information-gathering one first. Explicitly telling it when not to use a tool prevents this unnecessary round-trip.

3. Embedding runtime context in descriptions

"YYYY-MM-DD. Today is 2026-03-06."

The system prompt said today’s date, but the model still struggled to use it when filling tool parameters. Repeating the date inside the parameter description puts it right where the model needs it — at decision time, not three message turns ago.

The Results

Configuration	Tool called	Correct?	Rounds to answer
Minimal hints	(none)	No	Gave up, asked user
Minimal hints (run 2)	`list_events("", "2026-03-06")`	No	Empty account
Minimal hints (run 3)	`list_accounts()`	Wrong tool	2+ rounds, never finished
Aggressive hints	`list_events("zeoxia", "2026-03-06")`	Yes	2 rounds (tool + answer)

Same 1.5B model. Same prompt. Same tools. Only the descriptions changed.

The Mental Model

Think of tool descriptions as a contract surface between your system and the model. For large models (70B+, or cloud APIs like GPT-4/Claude), the contract can be loose — the model fills in gaps with reasoning. For small models, the contract must be tight:

Large model:   loose description + reasoning = correct call
Small model:   loose description + no reasoning = wrong call
Small model:   tight description + no reasoning = correct call

The practical implication: tool hint quality is a multiplier on model capability. A well-described tool on a 1.5B model outperforms a poorly-described tool on a 7B model.

What About Just Using a Bigger Model?

We also tested qwen2.5:7b. It handles minimal descriptions correctly about 70% of the time. With aggressive hints, it’s near 100%.

The tradeoff:

7B model: 4GB VRAM, ~2s latency per call
1.5B model: 1GB VRAM, ~0.5s latency per call

If you can get 1.5B working reliably with better hints, you get 4x faster responses and can run on a Raspberry Pi. That’s worth writing better descriptions for.

Checklist: Writing Tool Descriptions for Small Models

Use enum for every parameter with known values. Don’t make the model guess. If there are 6 valid accounts, list all 6.
Embed dynamic context in parameter descriptions. Today’s date, the user’s name, the current timezone — put it where it’s used, not just in the system prompt.
Add negative guidance to similar tools. If you have search and list, tell each one when not to be used. Small models can’t infer this.
Put the action verb first in descriptions. “Get calendar events” not “This tool can be used to retrieve calendar events from Google Calendar.” Small context windows need density.
Include an example in the description if the format is ambiguous. "Date in YYYY-MM-DD format, e.g. 2026-03-06" removes one more decision from the model.
Reduce tool count. Every tool in the context competes for the model’s attention. If you have 10 tools but the user will only need 3 in a given conversation, filter them.

Implications for MCP Server Design

If you’re building MCP servers meant to be consumed by models of varying sizes:

Don’t assume a large model. Your MCP server will be called by Claude, GPT-4, and also by someone’s local qwen2.5:1.5b. The descriptions need to work for all of them.
Enums are not optional. They’re the difference between working and not working on small models.
Tool descriptions are a UX surface. Treat them like API docs that will be read by a distracted junior developer with no domain context. That’s roughly what a 1.5B model is.

Conclusion

Model size and tool hint quality are both levers, but they’re not equal:

Upgrading from 1.5B to 7B: moderate improvement, 4x the cost
Upgrading tool descriptions: dramatic improvement, zero cost

Write your tool descriptions for the smallest model you want to support. Every model benefits. No model is harmed. The ten minutes you spend on better enum values and negative guidance will save you from debugging wrong tool calls forever.

Tested with: Ollama 0.6+, qwen2.5:1.5b, qwen2.5:7b, FastMCP 1.0. March 2026.