Large Language Models

PyTorch 2.10 + TorchAO on Intel Core Ultra 3

58 milliseconds to spit out the first token from a Qwen model. Intel's Core Ultra Series 3, juiced by PyTorch 2.10 and TorchAO, claims it's ready for prime-time AI on your laptop — but let's poke holes in the hype.

Intel Core Ultra 3 Delivers 58ms LLM Tokens via PyTorch 2.10 — But Is It Enough? — theAIcatchup

Key Takeaways

  • Core Ultra Series 3 achieves sub-300ms LLM latencies with PyTorch 2.10 and TorchAO quantization.
  • Unified PyTorch XPU backend supports Hugging Face models out-of-the-box, challenging CUDA dominance.
  • Unique edge: Full system memory access enables larger contexts, but power and ecosystem lag Nvidia.

58 milliseconds. That’s the first-token latency for Qwen3-0.6B on Intel’s Core Ultra Series 3 processors, running PyTorch 2.10 with TorchAO quantization. Blazing? For a laptop chip, maybe. But hold the applause.

Intel’s latest push into AI PCs reeks of catch-up fever. They’ve crammed in Xe3 architecture, up to 12 Xe-cores, 96 XMX engines promising 120 TOPS, and even 96GB LPDDR5x-9600 memory. Sounds beefy. Except Nvidia’s discrete GPUs laugh from the server racks, and Apple’s M-series sips power while crushing similar workloads.

Here’s the pitch: PyTorch 2.10 now plays nice with Intel’s XPU backend. No more begging for CUDA scraps. TorchAO handles quantization — int4 weights, anyone? — to squeeze big LLMs onto edge devices. And it’s “out-of-the-box,” they say. Install, quantize, infer. Easy as pie, if your pie’s baked on Ubuntu or Windows with the right drivers.

Why Intel’s Suddenly Obsessed with PyTorch

Look. Intel’s been the reliable CPU workhorse forever. But AI? They’ve been lapped. Remember the Habana Gaudi flop? Or the oneAPI dreams that went nowhere? Now, with Core Ultra Series 3 — Lunar Lake, if you’re into codenames — they’re betting on integrated GPUs for local AI. Private. Personalized. No cloud phoning home.

The table they flaunt? Impressive latencies across Qwen models, Phi-4 minis, even Llama 3.2-3B. 14.84ms for subsequent tokens on the tiny Qwen. Not bad for a mobile chip. But — and it’s a big but — these are quantized models. Int4a16 with torch.compile. Real-world? Your 70B behemoth won’t fit without serious compromises.

“The combination of dense matrix multiplication capabilities in the GPU with access to full system memory bandwidth gives Intel® Core™ Ultra Series 3 processors unique capabilities in the segment to run larger models and larger contexts.”

Unique? Sure, if “unique” means finally competitive. That’s straight from Intel’s overview. Corporate spin at its finest — implying no one else shares system memory. Spoiler: They do.

PyTorch 2.10’s XPU support unifies the experience. Same API as CUDA. Hugging Face Transformers? Diffusers? LeRobot? All work. Data types from int4 to float32. Bottlenecks like Linear layers and SDPA get Intel’s oneAPI tweaks. Developers rejoice? Maybe. If you’re tired of Nvidia’s moat.

Can PyTorch 2.10 Make Intel XPUs a Dev Darling?

Short answer: Probably not yet. But it’s a step. Here’s a quick Llama 3.1-8B example they provide — three lines to quantize and load on XPU.

import torch
from transformers import AutoModelForCausalLM, TorchAoConfig
from torchao.quantization import Int4WeightOnlyConfig

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quant_config = Int4WeightOnlyConfig(group_size=128, int4_packing_format="plain_int32")
quantization_config = TorchAoConfig(quant_config)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="xpu", torch_dtype=torch.float16, quantization_config=quantization_config)

Pip installs? Straightforward: pip install torch==2.10.0 --index-url https://download.pytorch.org/whl/xpu plus TorchAO nightly. Drivers first, though. Windows from Intel’s site; Ubuntu via their guide. Miss that, and you’re back to CPU purgatory.

My unique hot take? This echoes Intel’s Pentium era pivot — when they crushed AMD by optimizing for their own silicon. TorchAO could be that killer app for XPUs, forcing devs to rethink CUDA dependency. Bold prediction: By 2026, 30% of edge AI inference shifts here, if Intel subsidizes laptops hard. Otherwise, it’s vaporware.

But skepticism reigns. 120 TOPS sounds hot — until Qualcomm’s Snapdragon X Elite claims 45 TOPS NPU alone, and Apple’s A18 Pro hits neural engine peaks we can’t even verify. Intel’s full-system bandwidth brag? LPDDR5x helps, but power walls loom on thin-and-lights.

Performance table deep-dive. Qwen3-4B: 276ms first token, 33.54ms after. Phi-4-mini-instruct: 293ms cold, 32ms steady. Llama 3.2-3B: 242/27ms. DeepSeek partial in the snippet, but you get it — sub-300ms cold starts for 3-4B models. Viable for chatbots. Not for real-time video gen.

Edge cases? SYCL extensibility in PyTorch 2.10 lets you hack custom kernels. Faster dev cycles promised. But who’s got time for that when vLLM on CUDA just works?

Is This the AI PC Revolution Intel Promised?

Nah. It’s a toolkit upgrade. AI PCs need apps, not just inference speed. Copilot+? Recall? Those flopped on launch. Intel’s Arrow Lake desktop follow-up might pack more punch, but Series 3 is mobile-first.

Dry humor aside — imagine your Ultrabook as an AI sidekick. Private Llama chats without subscription. Quantized Phi for code tweaks on the go. If TorchAO delivers consistent 2x speedups over stock PyTorch, devs might bite.

Critique the PR: “Unlock a wider range of AI scenarios.” Vague much? It’s LLM inference, quantized. No vision, no multimodal yet. And that 96GB memory? Per system, shared. Not all configs hit it.

Wander a bit: Back in 2010, Intel hyped Larrabee as a GPU killer. Vaporized. History rhymes — will Core Ultra fade? Or force Nvidia to care about consumers?

One punchy para: Devs, try it. Worst case, you learn XPU quirks.

Dense para now: Installation snags aside — driver hell on Linux persists — the ecosystem breadth impresses; Hugging Face integration means porting Stable Diffusion or voice models takes minutes, not weeks, and with TorchAO’s int4 packing (plain_int32 format shines for XMX engines), memory footprints shrink 4x, enabling 8B models on 16GB RAM laptops, which — let’s face it — is most of the market, pushing AI from datacenter fantasy to backpack reality, though accuracy drops demand fine-tuning rituals most skip.

Medium: Still, competition heats up.


🧬 Related Insights

Frequently Asked Questions

What is TorchAO and how does it work with PyTorch on Intel? TorchAO is Intel’s quantization library for PyTorch, enabling int4/int8 weights to run big LLMs on XPUs. Pair it with PyTorch 2.10 for auto-quantization configs.

How do you install PyTorch 2.10 for Intel Core Ultra Series 3? Grab GPU drivers first, then pip install torch==2.10.0 --index-url https://download.pytorch.org/whl/xpu and TorchAO nightly. Works on Windows/Ubuntu.

Does PyTorch 2.10 support Llama models on Intel XPUs? Yes, load Llama 3.1-8B with int4 quantization via Transformers and TorchAoConfig — hits ~240ms first token on 3B variants.

Sarah Chen
Written by

AI research editor covering LLMs, benchmarks, and the race between frontier labs. Previously at MIT CSAIL.

Frequently asked questions

What is TorchAO and how does it work with PyTorch on Intel?
TorchAO is Intel's quantization library for PyTorch, enabling int4/int8 weights to run big LLMs on XPUs. Pair it with PyTorch 2.10 for auto-quantization configs.
How do you install PyTorch 2.10 for <a href="/tag/intel-core-ultra/">Intel Core Ultra</a> Series 3?
Grab GPU drivers first, then `pip install torch==2.10.0 --index-url https://download.pytorch.org/whl/xpu` and TorchAO nightly. Works on Windows/Ubuntu.
Does PyTorch 2.10 support Llama models on Intel XPUs?
Yes, load Llama 3.1-8B with int4 quantization via Transformers and TorchAoConfig — hits ~240ms first token on 3B variants.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by PyTorch Blog

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.