MetalRT Now Does Speech-to-Speech. 1.52x Faster Than mlx-audio.

We've shipped the fastest LLM decode, the fastest STT and TTS, and the fastest VLM engine for Apple Silicon.

Today, MetalRT adds speech-to-speech. Audio in, audio out. 1.68 seconds end-to-end.

We benchmarked LFM2.5-Audio against mlx-audio across three audio lengths. MetalRT won every single one.

123 tok/s generation throughput. 1.52x faster than mlx-audio.

We tested against:

mlx-audio - Apple's MLX audio framework (pip install mlx-audio)

Setup

Engine	Compute	Benchmark Method
MetalRT	Metal GPU	Native binary
mlx-audio	Metal GPU	Python pipeline

Hardware: Apple M4 Max, 64GB unified memory, macOS 26.3
Model: LFM2.5-Audio-1.5B (8-bit)
Audio: Real speech inputs at 3.8s, 11.2s, and 34.7s
Fairness: Identical audio inputs, same model across both engines

End-to-End Latency

Full pipeline: audio input, encode, generate text and audio tokens, detokenize to output audio. MetalRT won every length.

Lower is better

Audio	MetalRT	mlx-audio	Winner
Short (3.8s)	1,785ms	2,466ms	MetalRT
Medium (11.2s)	1,684ms	2,558ms	MetalRT
Long (34.7s)	1,771ms	2,603ms	MetalRT

1.68 seconds to hear a spoken response to spoken input. Entirely on-device.

Generation Throughput

Generation is the core of the pipeline — producing both text and audio tokens from the encoded input.

Higher is better

Audio	MetalRT	mlx-audio	Winner
Short (3.8s)	116 tok/s	82 tok/s	MetalRT
Medium (11.2s)	123 tok/s	79 tok/s	MetalRT
Long (34.7s)	118 tok/s	78 tok/s	MetalRT

The speedups:

1.42x on short audio
1.56x on medium audio
1.50x on long audio

MetalRT vs mlx-audio

MetalRT is 1.38-1.52x faster than mlx-audio on total end-to-end latency across all audio lengths.

What MetalRT Is Built For

Use Case	Why MetalRT
Voice assistants	Sub-2s spoken response to spoken input, entirely on-device
Real-time interpretation	Fast enough for conversational back-and-forth
Accessibility tools	Instant voice interaction without cloud dependencies
Privacy-first voice apps	Audio never leaves the device
Edge deployment	Voice AI in offline and air-gapped environments

Summary

1,684ms best end-to-end latency (Medium)
123 tok/s peak generation throughput
1.52x faster than mlx-audio (Medium)
100% win rate across all audio lengths

Output quality is comparable across both engines. The model is the same. The speed is not.

S2S support via MetalRT will be available through RCLI. Stay tuned.

Benchmarked on Apple M4 Max, 64GB RAM, macOS 26.3. Model: LFM2.5-Audio-1.5B (8-bit). Full pipeline: audio encode, text+audio token generation, detokenize to output WAV. Identical audio inputs across both engines.

MetalRT Now Does Speech-to-Speech. 1.52x Faster Than mlx-audio.

Setup

End-to-End Latency

Generation Throughput

MetalRT vs mlx-audio

What MetalRT Is Built For

Summary

Keep reading

MetalRT Now Runs Vision Language Models. Fastest on Apple Silicon.

MetalRT: The First Complete AI Inference Engine for Apple Silicon. Now with Speech.

We Built the Fastest LLM Decode Engine for Apple Silicon. Here Are the Numbers.