Small LLMs Beat Big in Function Calling Benchmarks

Ever wonder why your massive 70B model chokes on simple tool calls — while a featherweight 4B breezes through?

In a fresh 2026 benchmark by JD Hodges, tested on 13 quantized LLMs via LM Studio, function calling accuracy flipped every assumption. Qwen3.5 4B, clocking in at just 3.4GB of VRAM, hit 97.5% — nailing 39 out of 40 cases. That’s RTX 4060 territory, folks, with room for embeddings on the side.

A 25GB Nemotron 3 Nano 30B-A3B? 85%. Ouch.

97.5%の精度を出したのは3.4GBのモデルだった。25GBのモデルは85%で負けた。

Here’s the full leaderboard, all Q4_K_M quantized for fair play — slashing VRAM by 75% without gutting smarts:

順位	モデル	サイズ	精度
1	Qwen3.5 4B	3.4GB	97.5%
2	GLM-4.7-Flash	18GB	95.0%
2	Nemotron 3 Nano 4B	4.2GB	95.0%
4	Mistral Nemo 12B	7.5GB	92.5%
5	Qwen3 8B	5GB	85.0%

Size? Useless predictor. Look at Mistral Small 3.2’s 15GB flopping at 42.5% — 55 points behind the champ.

Why Small Models Are Dominating Function Calling Benchmarks?

Function calling isn’t chit-chat. It’s JSON surgery: exact schemas, no hallucinations, zero tolerance for fake functions. Big models shine on world knowledge, epic reasoning chains — but here? Formatting obedience rules.

Knowledge dependency? Low. It’s all instruction tuning and data quality. Qwen3.5’s secret sauce — my bet — lies in hyper-optimized tuning for structured outputs. Fewer params mean cleaner signals, less noise drowning the good stuff. (Remember AlphaGo? Specialized beats bloated generalists every time.)

And those laggards like xLAM-2 (15%)? LM Studio template glitches, not model fails — though this test nails ‘real-world deployability’ on consumer rigs.

Brutal truth: Parameter count ≠ task performance. Especially for agents, RAG pipelines, local LLMs where VRAM is king.

Plot size vs. accuracy. Scattershot. 3.4GB leader. 25GB mid-pack. 15GB disaster.

Does This Kill the ‘Bigger is Better’ Myth for Good?

Not entirely — free-text generation still favors giants. But tool use? Edge computing’s playground now. Think laptops, phones, IoT. Your 8GB GPU laughs at 3.4GB loads.

Unique angle: This echoes the 2010s mobile chip wars. ARM’s lean designs crushed x86 powerhogs in efficiency. Prediction — by 2027, 80% of agentic apps run sub-10B models quantized like this. Open-source floods the market; closed giants like GPT scramble.

Hodges’ test — 40 cases, unified env — exposes PR spin. ‘Scale laws’ don’t hold here. Corporate hype on 100B+? Cute, but irrelevant for practical tools.

Practical picks:

Purpose	Model	VRAM	Accuracy	Spare VRAM
Max Precision	Qwen3.5 4B	3.4GB	97.5%	4.6GB
Balanced	Nemotron 3 Nano 4B	4.2GB	95%	3.8GB
Multi-Turn	Mistral Nemo 12B	7.5GB	92.5%	0.5GB

How to Deploy Qwen3.5 4B for Tools Today

llama.cpp makes it dead simple. GBNF grammars force JSON — no parse fails.

import subprocess
import json

# Tools def here...

def call_with_tools(user_message: str, tools: list) -> dict:
    # Prompt build...
    result = subprocess.run([
        "llama-cli", "-m", "qwen3.5-4b-q4_k_m.gguf",
        "-p", prompt, "-n", "200", "--temp", "0.1",
        "--grammar-file", "json.gbnf"
    ], capture_output=True, text=True)
    # Parse...

Without GBNF? Garbage like ‘{“name”: “get_weather”} Let me check…’ Parse error city.

With it? Pure JSON. Tokyo weather query spits {“name”: “get_weather”, “arguments”: {“location”: “Tokyo”}}.

Market shift incoming. Devs ditch cloud costs — local agents everywhere. Qwen’s win? Alibaba’s open play pays off, pressuring Mistral, Nvidia’s CUDA lock-in.

But caveats. n=13, LM Studio bias. BFCL leaderboards differ sans quantization. Still — deploy signal screams loud.

Here’s the thing: If you’re building agents, benchmark your stack. Don’t chase GBs.

🧬 Related Insights

Frequently Asked Questions

What is the best small model for LLM function calling? Qwen3.5 4B Q4_K_M at 97.5% accuracy on this test — fits 8GB VRAM easy.

Why do small LLMs beat large ones in tool use? Tool calling prioritizes format compliance over knowledge; tuning shines in compact models.

How to run function calling locally with llama.cpp? Use GBNF grammars for JSON enforcement — see code above for Qwen3.5 setup.

Small LLMs Beat Big in Function Calling Benchmarks

Key Takeaways

Why Small Models Are Dominating Function Calling Benchmarks?

Does This Kill the ‘Bigger is Better’ Myth for Good?

How to Deploy Qwen3.5 4B for Tools Today

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

Why Small Models Are Dominating Function Calling Benchmarks?

Does This Kill the ‘Bigger is Better’ Myth for Good?

How to Deploy Qwen3.5 4B for Tools Today

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

Google's TPU Juggernaut vs. Anthropic's Soulful Claude: Two Paths to AI Supremacy

Gemma 4 Lands Hard: Google's Open-Weight Arsenal Fires Back at China

Gemma 4: Google's Tiny AI Titans That Might Actually Fit Your Laptop

Build Qwen3 From Scratch: Your Ticket to Open AI Mastery

Stay in the loop

Key Takeaways