March 3, 2026

·

We Built the Fastest LLM Decode Engine for Apple Silicon. Here Are the Numbers.

We Built the Fastest LLM Decode Engine for Apple Silicon. Here Are the Numbers.
DEVELOPERS

We set out to answer a simple question: how fast can you run an LLM on Apple Silicon if you throw away every abstraction and go straight to the metal?

The answer: 658 tokens per second on Qwen3-0.6B, 570 tok/s on a hybrid conv+attention architecture, and up to 26x faster multi-turn conversations — all on a single M4 Max with zero framework overhead.

And that's not just one model. We tested MetalRT across four different models against three established engines — and MetalRT won 3 out of 4 decode benchmarks, averaging 1.67x faster than llama.cpp on directly comparable models.

Today we're sharing the full results from MetalRT, our bare-metal LLM inference engine for Apple Silicon, built by the RunAnywhere AI team. We tested it head-to-head against:

  • uzu — a production-grade Rust inference engine by Mirai Tech
  • mlx-lm — Apple's own MLX inference framework
  • llama.cpp — the most widely-used open-source inference engine

The results speak for themselves.

Decode Speed — MetalRT Wins 3 of 4 Models

Decode speed determines how fast tokens stream to the user. It's the metric that matters most for interactive chat. MetalRT took first place on 3 out of 4 models.

Decode speed comparison across all engines and models
Decode throughput comparison — all 4 engines across all models
ModelMetalRTuzumlx-lmllama.cpp
Qwen3-0.6B658627552295*
Qwen3-4B18616517087
Llama-3.2-3B184222210137
LFM2.5-1.2B570550509372

*Qwen3-0.6B llama.cpp uses Q8_0 (8-bit) — not directly comparable with 4-bit engines.

MetalRT leads decode throughput on 3 of 4 models — from the smallest 0.6B to a 4B transformer to a hybrid conv+attention architecture. The speedups are significant:

  • 1.05–1.12x vs uzu on the models MetalRT wins
  • 1.10–1.19x vs mlx-lm using the exact same model files
  • 1.35–2.23x vs llama.cpp across the board

uzu takes Llama-3.2-3B with 222 tok/s — a genuine win. With proper thermal cooldowns preventing measurement bias, uzu's Llama-optimized runtime outperforms our generic pipeline on that specific model. We report this honestly.

MetalRT vs llama.cpp

llama.cpp is the baseline everyone knows. Here's how MetalRT stacks up:

MetalRT decode speedup vs llama.cpp
MetalRT decode throughput advantage over llama.cpp
ModelMetalRTllama.cppSpeedup
Qwen3-0.6B6582952.23x (different quant)
Qwen3-4B186872.14x
Llama-3.2-3B1841371.35x
LFM2.5-1.2B5703721.53x

Average speedup on Q4 models: 1.67x. That's a 67% throughput improvement over the most popular inference engine in the world, on the same hardware.

Multi-Turn Conversations — Where MetalRT Truly Dominates

This is where MetalRT's architecture really shines. We ran a 15-turn conversation on each model and measured total time-to-first-token as the context grows.

Most engines get progressively slower with each turn as the conversation grows. MetalRT doesn't — it stays nearly flat regardless of conversation length.

Per-turn TTFT on Llama-3.2-3B — MetalRT stays flat while other engines scale linearly
Per-turn TTFT on Llama-3.2-3B — MetalRT stays flat while others scale linearly

On Llama-3.2-3B, by turn 15 mlx-lm takes 661ms per turn while MetalRT takes just 14ms. That's a 26x total TTFT improvement across the full conversation.

Multi-turn speedup across all models
Multi-turn conversation speedup — MetalRT vs mlx-lm total TTFT across 15 turns
ModelMetalRTmlx-lmSpeedup
Llama-3.2-3B171 ms4,451 ms26.0x
Qwen3-4B1,393 ms6,956 ms5.0x
Qwen3-0.6B185 ms875 ms4.7x
LFM2.5-1.2B455 ms1,536 ms3.4x

While other engines get progressively slower with each turn, MetalRT stays nearly flat.

Per-Turn Breakdown: All Models

Per-turn TTFT across all four models
Per-turn TTFT across all four models

Turn-by-Turn: Qwen3-0.6B

TurnMetalRTmlx-lm
15.8 ms13.9 ms
512.4 ms37.1 ms
1012.3 ms71.7 ms
1517.0 ms107.1 ms

By turn 15, mlx-lm takes 107ms to start responding. MetalRT takes 17ms. Other engines scale linearly with conversation length; MetalRT's TTFT stays nearly flat — only increasing from 5.8ms to 17ms across 15 turns.

Time-to-First-Token — Near-Instant Responsiveness

TTFT is how long the user waits before seeing the first token. MetalRT delivers sub-55ms response times on every model tested and wins on Qwen3-4B — the largest transformer in the benchmark.

ModelMetalRTuzumlx-lmWinner
Qwen3-0.6B6.6 ms6.2 ms12.4 msuzu
Qwen3-4B41.9 ms51.3 ms43.6 msMetalRT
Llama-3.2-3B55.0 ms43.9 ms39.9 msmlx-lm
LFM2.5-1.2B13.2 ms12.3 ms12.8 msuzu

Single-turn TTFT is competitive across all three native engines, with the win split three ways. Where MetalRT truly shines is multi-turn, where it keeps TTFT near-constant regardless of conversation length.

Prefill Speed

Prefill speed measures how fast the engine processes the input prompt before generating the first token. This matters for long-context scenarios like document summarization and RAG pipelines.

ModelMetalRTuzumlx-lmllama.cpp
Qwen3-0.6B2,732 tok/s3,0661,4541,793
Qwen3-4B430 tok/s371413453
Llama-3.2-3B363 tok/s1,0471,153656
LFM2.5-1.2B1,434 tok/s1,6321,5641,613

Prefill is the one area where MetalRT doesn't lead — uzu and llama.cpp take most prefill wins.

Output Quality — Identical Across Engines

All engines produce identical-quality output. We verified across all 4 models with both greedy and temperature sampling.

On Qwen3-4B with greedy decoding, all engines produce character-for-character identical output:

"Okay, the user is asking for a short introduction to large language models. Let me start by recalling what I know about LLMs. They are AI models that process and generate human-like text. I should mention their size, like having billions of parameters..."

On Llama-3.2-3B, MetalRT and mlx-lm produce nearly word-for-word identical responses. The slight differences across engines are due to quantization format differences, not engine quality. The model is the same — the output quality is the same.

Where MetalRT Wins

Use CaseBest EngineWhy
Decode throughputMetalRTWins decode on 3/4 models
Multi-turn chatMetalRT3.4–26x faster multi-turn response times
Prefill throughputuzuWins prefill on most models

Built for Real-World On-Device AI

MetalRT's decode speed and multi-turn responsiveness aren't just benchmark numbers — they unlock use cases that are impractical with slower engines:

AI Assistants & Chat Apps — At 186 tok/s on a 4B model, responses stream fast enough to feel instantaneous. Multi-turn conversations stay snappy through turn 15 and beyond, without the progressively slower responses users experience on other engines.

Structured Output & Tool Calling — Generating JSON, function calls, or structured data from an LLM is decode-bound. Faster decode means faster structured output — critical for agent workflows where the LLM output drives the next action.

Agent Loops & Agentic Workflows — Agents make many sequential LLM calls. When each call decodes 1.67x faster and multi-turn context is handled without re-processing, compound latency savings turn slow agent pipelines into real-time ones.

On-Device Coding Assistants — Code completion and inline suggestions are extremely latency-sensitive. Sub-7ms TTFT on small models means suggestions appear as fast as the user can type.

Privacy-First Applications — Banking, healthcare, legal — any domain where data can't leave the device. MetalRT delivers cloud-competitive speed entirely on-device, making private AI practical rather than compromised.

Real-Time Voice Pipelines — In a listen → transcribe → think → speak pipeline, the "think" step is LLM decode. Faster decode shrinks the gap between the user finishing their sentence and hearing the AI respond.

What This Means

We built MetalRT to prove that on-device LLM inference on Apple Silicon has significant untapped performance:

  • 1.67x faster than llama.cpp on decode (average across Q4 models)
  • 1.10–1.19x faster than mlx-lm on decode (same model files)
  • 3.4–26x faster multi-turn response times
  • 658 tok/s peak decode on Qwen3-0.6B

This isn't a theoretical benchmark. These numbers translate directly to user experience:

  • Tokens that stream at 658/s instead of 295/s (Qwen3-0.6B vs llama.cpp)
  • A 15-turn conversation where every turn responds in 17ms instead of 107ms (Qwen3-0.6B)
  • A mid-range model at 186 tok/s instead of 87 tok/s (Qwen3-4B vs llama.cpp)
  • A hybrid architecture running at 570 tok/s instead of 372 tok/s (LFM2.5 vs llama.cpp)

For developers building on-device AI products on Apple Silicon — chat apps, AI assistants, coding tools, agent frameworks — these gains compound into meaningfully better user experiences.


All benchmarks were run on March 3, 2026 on an Apple M4 Max with 64 GB unified memory running macOS 26.3. Models: Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, LFM2.5-1.2B — all 4-bit quantized. Greedy decoding (temperature=0) for performance metrics; temperature=0.7 with top_k=50 for quality verification. 5 runs per benchmark, best reported. 5s cooldown between engine runs, 10s between models. Multi-turn: 15-turn conversation, 64 tokens/turn, best of 3 runs. MetalRT + mlx-lm share identical MLX 4-bit model files; uzu uses lalamo-converted files; llama.cpp uses GGUF Q4_K_M (Q8_0 for Qwen3-0.6B).

RunAnywhere Logo

RunAnywhere

Connect with developers, share ideas, get support, and stay updated on the latest features. Our Discord community is the heart of everything we build.

Company

Copyright © 2025 RunAnywhere, Inc.