February 22, 2026

·

FastVoice: 63ms First-Audio Latency for On-Device Voice AI on Apple Silicon

FastVoice: 63ms First-Audio Latency for On-Device Voice AI on Apple Silicon
DEVELOPERS

How fast can a voice AI respond if it never leaves your device?

63 milliseconds. From the moment you stop speaking to the moment you hear the first word back. That's faster than a human blink.

We built FastVoice — a fully on-device voice AI pipeline in C++ that composes speech recognition, LLM inference, and text-to-speech on Apple Silicon. No cloud APIs. No network round-trips. No privacy compromises.

This post walks through the system architecture, the four optimizations that got us there, and the ablation study that proves each one matters.

Why First-Audio Latency?

Most LLM benchmarks measure throughput — tokens per second. That metric is wrong for voice.

In a voice conversation, what the user perceives is how long they wait before hearing a response. Research shows the human perceptual threshold for conversational responsiveness is 200–500ms. Anything under 200ms feels instant.

We call this metric first-audio latency: the time from end of user speech to first audible word of the response.

First-audio latency breakdown
What makes up first-audio latency

The formal decomposition:

text
1L_first = T_vad + T_asr + T_ttft + T_gen(k) + T_tts(c₁)

The key lever is k — the number of tokens generated before TTS has enough text to speak. At 205 tok/s, each extra token adds ~4.9ms. Minimizing k is the game.

The Pipeline

FastVoice is a multi-stage streaming pipeline. Each stage runs on a dedicated OS thread, connected by lock-free ring buffers.

FastVoice pipeline architecture
STT → LLM → TTS streaming pipeline
StageComponentDetails
VADSilero VADVoice activity detection, speech segmentation
STTWhisper / MoonshineVia sherpa-onnx, CoreAudio integration
LLMQwen3 0.6B (Q4_K_M)llama.cpp, Metal GPU, KV cache Q8_0
TTSPiper VITS (Amy medium)1.1x speed scaling, word-level flushing
AudioCoreAudioZero-allocation real-time output

The entire system uses a pre-allocated 64MB memory pool backed by OS-level 2MB superpages with bump allocation and 64-byte cache-line alignment. No malloc in the hot path. No GC pauses. No jitter.

Baseline: Where Does the Time Go?

Before any optimizations, here's where first-audio latency comes from:

Baseline latency pie chart
Baseline latency decomposition by stage
StageLatencyShare
VAD4.8ms6%
STT (ASR)37.4ms46%
LLM TTFT13.1ms16%
LLM gen → first sentence~25ms31%
TTS (first chunk)~151ms
First-audio80.6ms

STT dominates at 46%. But the real opportunity is in how we bridge LLM generation and TTS — the streaming granularity.

The Four Optimizations

We applied four optimizations, each ablated independently.

1. KV Cache Quantization (FP16 → Q8_0)

Standard KV caches store attention states at FP16 (2 bytes per element). We quantize to Q8_0 (1.0625 bytes per element).

KV cache memory comparison
KV cache memory: FP16 vs Q8_0
MetricFP16Q8_0Change
KV cache memory224 MB119 MB-47%
Long-tier first-audio82.9ms69.7ms-16%
Decode throughput205 tok/s194 tok/s-5.6%

The 5.6% throughput regression is worth it. On Apple Silicon's unified memory, reducing KV cache size directly reduces GPU memory pressure, which improves latency across the entire pipeline. This cross-component interaction is invisible if you benchmark the LLM in isolation.

2. Word-Level Streaming Flush

This is the big one.

Standard TTS pipelines wait for a complete sentence before synthesizing speech. That means the LLM must generate enough tokens to form an entire sentence — potentially 15–30 tokens — before the user hears anything.

Word-level flushing changes the rule: emit text to TTS after N=7 words, even without a sentence boundary.

Word flush impact on first-audio
First-audio latency: sentence flush vs word flush (N=7)
Response TierSentence FlushWord Flush (N=7)Change
Short54.2ms55.0ms+1.5%
Medium63.5ms62.8ms-1.1%
Long82.9ms63.1ms-24%
Longer76.2ms66.4ms-13%
Longest77.1ms67.7ms-12%

The critical insight: word-level flushing makes first-audio latency nearly constant (~55–77ms) regardless of how long the LLM's response is. Without it, longer responses mean longer waits.

Word Count Threshold Sweep

We swept N from 1 to ∞ to find the optimal threshold:

N-sweep results
First-audio latency across word flush thresholds
Threshold (N)First-Audio (Long)Notes
1~50msFragmented, robotic prosody
3~55msBetter, still choppy
763.1msSweet spot for VITS models
10~70msMarginal benefit over sentence
∞ (sentence)82.9msStandard approach

N=7 hits the quality-latency Pareto frontier: low enough to start TTS early, high enough that Piper produces natural prosody.

3. System Prompt KV Cache

The system prompt is identical across all user queries. We cache its KV state at initialization and restore it per query instead of re-evaluating.

MetricWithout CacheWith CacheChange
TTFT15.6ms13.1ms-16%

Small but free. Every millisecond matters in voice.

4. Asynchronous LLM–TTS Pipeline

In the baseline, TTS synthesis blocks the LLM decode thread. While Piper synthesizes one sentence (~147ms), the LLM sits idle.

We refactored to a producer-consumer queue: the LLM pushes text chunks to a queue, and a dedicated TTS worker thread consumes them independently.

Sync vs async pipeline timeline
Synchronous vs asynchronous LLM-TTS pipeline

For a 6-sentence response, this recovers ~882ms of blocked LLM generation time. The LLM never waits for TTS again.

Cumulative Results

All four optimizations combined:

Cumulative optimization results
First-audio latency: baseline vs fully optimized
MetricBaselineOptimizedChange
First-audio (Long)80.6ms63.1ms-22%
KV cache memory224 MB119 MB-47%
First-audio range54–83ms55–77msNear-constant
Tool-call accuracy100%100%Preserved

63ms first-audio. Well under the 200ms perceptual threshold. The user hears a response before they consciously register that they've stopped speaking.

Model Scale: What's Fast Enough?

We tested across three Qwen3 model sizes:

ModelFirst-AudioDecode SpeedMeets <200ms?
0.6B63.1ms194 tok/sYes
1.7B148ms89 tok/sYes (barely)
4B326ms42 tok/sNo

On current Apple Silicon, only sub-1B models achieve truly instant voice interaction. As hardware improves, larger models will cross the threshold.

What This Means

FastVoice proves that cloud-competitive voice AI is possible entirely on-device:

  • Privacy: Audio and text never leave the device
  • Offline: Works without any network connection
  • Cost: Zero API calls, zero per-query charges
  • Latency: 63ms — faster than any cloud round-trip

The pipeline handles tool calling with 100% correctness, enabling on-device voice agents that can check weather, control smart home devices, or query databases — all without touching a server.

Summary

  • 63ms first-audio latency (22% reduction from baseline)
  • 47% KV cache memory reduction (224MB → 119MB)
  • Near-constant first-audio (~55–77ms) regardless of response length
  • 100% tool-call correctness across all configurations
  • Zero cloud dependencies

The bottleneck in voice AI isn't the model. It's the pipeline. FastVoice shows what happens when you optimize the full stack — from VAD to audio output — as a single, streaming C++ system on Apple Silicon.


Evaluated on Apple Silicon M3 Max, 36GB unified memory. Model: Qwen3 0.6B Q4_K_M with Q8_0 KV cache. STT: Whisper tiny.en via sherpa-onnx. TTS: Piper Amy medium at 1.1x speed. File-mode benchmarks, 5 runs, median reported. Word flush threshold N=7.

RunAnywhere Logo

RunAnywhere

Connect with developers, share ideas, get support, and stay updated on the latest features. Our Discord community is the heart of everything we build.

Company

Copyright © 2025 RunAnywhere, Inc.