March 13, 2026

·

MetalRT Now Runs Vision Language Models. Fastest on Apple Silicon.

MetalRT Now Runs Vision Language Models. Fastest on Apple Silicon.
DEVELOPERS

First, we shipped the fastest LLM decode engine for Apple Silicon. Then, the fastest Speech-to-Text and Text-to-Speech.

Today, MetalRT adds vision. The fastest VLM decode engine for Apple Silicon.

We benchmarked Qwen3-VL-2B across four image resolutions against every major engine. MetalRT won every single configuration.

279 tok/s average vision decode. 92ms time-to-output. 100% win rate.

We tested against:

  • mlx-vlm - Apple's MLX vision-language framework
  • llama.cpp - the most widely-used open-source inference engine

Setup

EngineTypeBenchmark Method
MetalRTC++Native binary
mlx-vlmPython + MLX C++Python API
llama.cppC/C++Server API (server-reported)
  • Hardware: Apple M4 Max, 64GB unified memory, macOS 26.3
  • Model: Qwen3-VL-2B-Instruct (4-bit quantized)
  • Images: Real photograph (COCO val2017) at 224×224, 320×320, 448×448, 448×896
  • Runs: 5 per engine, averaged
  • Fairness: Identical image, identical prompt, greedy decoding (temperature=0) across all engines

Vision Decode Throughput

Decode speed is how fast tokens stream to the user after the image is processed. MetalRT won every resolution.

Vision Decode Throughput

Higher is better

ImageMetalRTmlx-vlmllama.cppWinner
224×224285.4239.1231.0MetalRT
320×320286.6234.8230.2MetalRT
448×448279.3226.7226.8MetalRT
448×896264.7217.3215.3MetalRT

The speedups:

  • 1.19-1.23x vs mlx-vlm
  • 1.23-1.25x vs llama.cpp

Time-to-Output

Time-to-output measures the full pipeline — vision encoding, prefill, and first tokens out. This is what the user actually feels.

Time-to-Output

Lower is better

ImageMetalRTmlx-vlmllama.cppWinner
224×22492.4ms118.5141.9MetalRT
320×320120.3ms142.8161.5MetalRT
448×448171.8ms204.1219.5MetalRT
448×896346.1ms334.7362.7MetalRT

92ms to process an image and start generating. That's faster than a single frame at 12fps.

Text Decode Throughput

VLMs also handle text-only prompts. MetalRT wins every context length.

Text Decode Throughput

Higher is better

Context @ DecodeMetalRTmlx-vlmllama.cppWinner
128 @ 50290.3244.9230.6MetalRT
256 @ 100290.2232.0221.0MetalRT
512 @ 100267.6219.8203.4MetalRT
1024 @ 100245.2227.8208.0MetalRT
2048 @ 100230.5213.8197.9MetalRT

MetalRT vs mlx-vlm and llama.cpp

Decode Speedup

MetalRT is 1.19-1.23x faster than mlx-vlm and 1.23-1.25x faster than llama.cpp on vision decode across all resolutions.

What MetalRT Is Built For

Use CaseWhy MetalRT
Visual assistants92ms to process an image and start responding
Document understandingFast vision decode for OCR and document QA pipelines
Multimodal agentsCompound latency savings across vision + language calls
Accessibility toolsDescribe images instantly, entirely on-device
Privacy-first appsProcess sensitive images without sending them to a cloud
Edge deploymentReal-time visual inference in offline environments

Summary

  • 279 tok/s average vision decode
  • 92ms time-to-output (224×224)
  • 1.22x faster than mlx-vlm on vision decode
  • 1.24x faster than llama.cpp on vision decode
  • 100% decode win rate (36 out of 36 configs)

Output quality is identical across all engines. The model is the same. The speed is not.

VLM support via MetalRT will be available through RCLI. Stay tuned.


Benchmarked on Apple M4 Max, 64GB RAM, macOS 26.3. Model: Qwen3-VL-2B-Instruct (4-bit). Greedy decoding, 5 runs, averaged. Images: COCO val2017, resized to target resolution. KV cache cleared between every run on all engines.

RunAnywhere Logo

RunAnywhere

Connect with developers, share ideas, get support, and stay updated on the latest features. Our Discord community is the heart of everything we build.

Company

Copyright © 2025 RunAnywhere, Inc.