Large Language Models

Small LLMs Beat Big in Function Calling Benchmarks

Forget parameter counts. A puny 3.4GB model just schooled the heavyweights in function calling. Here's the data shaking up LLM deployment.

3.4GB AI Model Crushes 25GB Giants in Tool-Calling Tests — theAIcatchup

Key Takeaways

  • 3.4GB Qwen3.5 4B tops function calling at 97.5%, beating 25GB models.
  • Size doesn't predict accuracy; tuning for structured output wins.
  • Local deployment on 8GB GPUs now viable for agents — use llama.cpp + GBNF.

Ever wonder why your massive 70B model chokes on simple tool calls — while a featherweight 4B breezes through?

In a fresh 2026 benchmark by JD Hodges, tested on 13 quantized LLMs via LM Studio, function calling accuracy flipped every assumption. Qwen3.5 4B, clocking in at just 3.4GB of VRAM, hit 97.5% — nailing 39 out of 40 cases. That’s RTX 4060 territory, folks, with room for embeddings on the side.

A 25GB Nemotron 3 Nano 30B-A3B? 85%. Ouch.

97.5%の精度を出したのは3.4GBのモデルだった。25GBのモデルは85%で負けた。

Here’s the full leaderboard, all Q4_K_M quantized for fair play — slashing VRAM by 75% without gutting smarts:

順位 モデル サイズ 精度
1 Qwen3.5 4B 3.4GB 97.5%
2 GLM-4.7-Flash 18GB 95.0%
2 Nemotron 3 Nano 4B 4.2GB 95.0%
4 Mistral Nemo 12B 7.5GB 92.5%
5 Qwen3 8B 5GB 85.0%

Size? Useless predictor. Look at Mistral Small 3.2’s 15GB flopping at 42.5% — 55 points behind the champ.

Why Small Models Are Dominating Function Calling Benchmarks?

Function calling isn’t chit-chat. It’s JSON surgery: exact schemas, no hallucinations, zero tolerance for fake functions. Big models shine on world knowledge, epic reasoning chains — but here? Formatting obedience rules.

Knowledge dependency? Low. It’s all instruction tuning and data quality. Qwen3.5’s secret sauce — my bet — lies in hyper-optimized tuning for structured outputs. Fewer params mean cleaner signals, less noise drowning the good stuff. (Remember AlphaGo? Specialized beats bloated generalists every time.)

And those laggards like xLAM-2 (15%)? LM Studio template glitches, not model fails — though this test nails ‘real-world deployability’ on consumer rigs.

Brutal truth: Parameter count ≠ task performance. Especially for agents, RAG pipelines, local LLMs where VRAM is king.

Plot size vs. accuracy. Scattershot. 3.4GB leader. 25GB mid-pack. 15GB disaster.

Does This Kill the ‘Bigger is Better’ Myth for Good?

Not entirely — free-text generation still favors giants. But tool use? Edge computing’s playground now. Think laptops, phones, IoT. Your 8GB GPU laughs at 3.4GB loads.

Unique angle: This echoes the 2010s mobile chip wars. ARM’s lean designs crushed x86 powerhogs in efficiency. Prediction — by 2027, 80% of agentic apps run sub-10B models quantized like this. Open-source floods the market; closed giants like GPT scramble.

Hodges’ test — 40 cases, unified env — exposes PR spin. ‘Scale laws’ don’t hold here. Corporate hype on 100B+? Cute, but irrelevant for practical tools.

Practical picks:

Purpose Model VRAM Accuracy Spare VRAM
Max Precision Qwen3.5 4B 3.4GB 97.5% 4.6GB
Balanced Nemotron 3 Nano 4B 4.2GB 95% 3.8GB
Multi-Turn Mistral Nemo 12B 7.5GB 92.5% 0.5GB

How to Deploy Qwen3.5 4B for Tools Today

llama.cpp makes it dead simple. GBNF grammars force JSON — no parse fails.

import subprocess
import json

# Tools def here...

def call_with_tools(user_message: str, tools: list) -> dict:
    # Prompt build...
    result = subprocess.run([
        "llama-cli", "-m", "qwen3.5-4b-q4_k_m.gguf",
        "-p", prompt, "-n", "200", "--temp", "0.1",
        "--grammar-file", "json.gbnf"
    ], capture_output=True, text=True)
    # Parse...

Without GBNF? Garbage like ‘{“name”: “get_weather”} Let me check…’ Parse error city.

With it? Pure JSON. Tokyo weather query spits {“name”: “get_weather”, “arguments”: {“location”: “Tokyo”}}.

Market shift incoming. Devs ditch cloud costs — local agents everywhere. Qwen’s win? Alibaba’s open play pays off, pressuring Mistral, Nvidia’s CUDA lock-in.

But caveats. n=13, LM Studio bias. BFCL leaderboards differ sans quantization. Still — deploy signal screams loud.

Here’s the thing: If you’re building agents, benchmark your stack. Don’t chase GBs.


🧬 Related Insights

Frequently Asked Questions

What is the best small model for LLM function calling? Qwen3.5 4B Q4_K_M at 97.5% accuracy on this test — fits 8GB VRAM easy.

Why do small LLMs beat large ones in tool use? Tool calling prioritizes format compliance over knowledge; tuning shines in compact models.

How to run function calling locally with llama.cpp? Use GBNF grammars for JSON enforcement — see code above for Qwen3.5 setup.

Elena Vasquez
Written by

Senior editor and generalist covering the biggest stories with a sharp, skeptical eye.

Frequently asked questions

What is the best small model for LLM function calling?
Qwen3.5 4B Q4_K_M at 97.5% accuracy on this test — fits 8GB VRAM easy.
Why do small LLMs beat large ones in tool use?
Tool calling prioritizes format compliance over knowledge; tuning shines in compact models.
How to run function calling locally with llama.cpp?
Use GBNF grammars for JSON enforcement — see code above for Qwen3.5 setup.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by Dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.