March 15, 2026

·

MetalRT Now Does Speech-to-Speech. 1.52x Faster Than mlx-audio.

MetalRT Now Does Speech-to-Speech. 1.52x Faster Than mlx-audio.
DEVELOPERS

We've shipped the fastest LLM decode, the fastest STT and TTS, and the fastest VLM engine for Apple Silicon.

Today, MetalRT adds speech-to-speech. Audio in, audio out. 1.68 seconds end-to-end.

We benchmarked LFM2.5-Audio against mlx-audio across three audio lengths. MetalRT won every single one.

123 tok/s generation throughput. 1.52x faster than mlx-audio.

We tested against:

  • mlx-audio - Apple's MLX audio framework (pip install mlx-audio)

Setup

EngineComputeBenchmark Method
MetalRTMetal GPUNative binary
mlx-audioMetal GPUPython pipeline
  • Hardware: Apple M4 Max, 64GB unified memory, macOS 26.3
  • Model: LFM2.5-Audio-1.5B (8-bit)
  • Audio: Real speech inputs at 3.8s, 11.2s, and 34.7s
  • Fairness: Identical audio inputs, same model across both engines

End-to-End Latency

Full pipeline: audio input, encode, generate text and audio tokens, detokenize to output audio. MetalRT won every length.

Total Latency

Lower is better

AudioMetalRTmlx-audioWinner
Short (3.8s)1,785ms2,466msMetalRT
Medium (11.2s)1,684ms2,558msMetalRT
Long (34.7s)1,771ms2,603msMetalRT

1.68 seconds to hear a spoken response to spoken input. Entirely on-device.

Generation Throughput

Generation is the core of the pipeline — producing both text and audio tokens from the encoded input.

Throughput

Higher is better

AudioMetalRTmlx-audioWinner
Short (3.8s)116 tok/s82 tok/sMetalRT
Medium (11.2s)123 tok/s79 tok/sMetalRT
Long (34.7s)118 tok/s78 tok/sMetalRT

The speedups:

  • 1.42x on short audio
  • 1.56x on medium audio
  • 1.50x on long audio

MetalRT vs mlx-audio

Speedup

MetalRT is 1.38-1.52x faster than mlx-audio on total end-to-end latency across all audio lengths.

What MetalRT Is Built For

Use CaseWhy MetalRT
Voice assistantsSub-2s spoken response to spoken input, entirely on-device
Real-time interpretationFast enough for conversational back-and-forth
Accessibility toolsInstant voice interaction without cloud dependencies
Privacy-first voice appsAudio never leaves the device
Edge deploymentVoice AI in offline and air-gapped environments

Summary

  • 1,684ms best end-to-end latency (Medium)
  • 123 tok/s peak generation throughput
  • 1.52x faster than mlx-audio (Medium)
  • 100% win rate across all audio lengths

Output quality is comparable across both engines. The model is the same. The speed is not.

S2S support via MetalRT will be available through RCLI. Stay tuned.


Benchmarked on Apple M4 Max, 64GB RAM, macOS 26.3. Model: LFM2.5-Audio-1.5B (8-bit). Full pipeline: audio encode, text+audio token generation, detokenize to output WAV. Identical audio inputs across both engines.

RunAnywhere Logo

RunAnywhere

Connect with developers, share ideas, get support, and stay updated on the latest features. Our Discord community is the heart of everything we build.

Company

Copyright © 2025 RunAnywhere, Inc.