February 24, 2026

·

FastVoice RAG: Sub-200ms Voice AI with Retrieval-Augmented Generation, Entirely On-Device

FastVoice RAG: Sub-200ms Voice AI with Retrieval-Augmented Generation, Entirely On-Device
DEVELOPERS

Two days ago, we shipped FastVoice — a 63ms on-device voice AI pipeline. Today, we're giving it a knowledge base.

FastVoice RAG adds hybrid retrieval-augmented generation to our latency-optimized STT → LLM → TTS stack. The entire pipeline — including retrieval over 5,016 document chunks — runs on Apple Silicon with zero cloud dependencies.

The headline number: sub-200ms first-audio with full RAG retrieval. Here's how.

The Problem with RAG in Voice

RAG grounds LLM responses in external knowledge. It's standard practice in cloud-based chatbots. But voice AI has a constraint that chatbots don't: latency kills the conversation.

Research shows humans perceive conversational delays above 200–300ms. Cloud RAG pipelines (LlamaIndex, RAGFlow, PrivateGPT) routinely exceed 1–2 seconds for retrieval alone — before the LLM even starts generating.

We needed to answer: can you add RAG to a real-time voice pipeline without breaking the latency budget?

The Latency Decomposition

We instrumented every stage of the pipeline at microsecond resolution. Here's what we found:

RAG latency breakdown
Where the time actually goes in a voice RAG pipeline

Full Pipeline Breakdown (top-k=5)

StageLatencyNotes
Query preprocessing0.003msTokenization
Embedding (cached)0.015μs99.9% cache hit rate
Embedding (uncached)5.68msSnowflake Arctic Embed S, Q8_0
HNSW vector search0.107ms384-dim, 5,016 chunks
BM25 lexical search0.018msPre-computed IDF
RRF fusion0.001msZero-allocation score buffers
Retrieval total<4msIncluding embedding
LLM TTFT (no RAG)22.5msBaseline
LLM TTFT (with RAG)57.7ms+157% from retrieved context
TTS first sentence~92msPiper, word-level flush

The retrieval paradox: retrieval itself is negligible (<0.15ms without embedding). The real cost is LLM prefill — processing the retrieved chunks adds 157% to time-to-first-token.

But we have a solution for that.

The Architecture

FastVoice RAG extends our voice pipeline with a hybrid retrieval engine between STT and LLM.

FastVoice RAG pipeline
Mic → STT → RAG Retrieval → LLM → TTS → Speaker

Hybrid Retrieval Engine

We combine three retrieval methods via Reciprocal Rank Fusion (RRF):

MethodWhat It DoesLatency
HNSWSemantic vector search (384-dim)0.107ms
BM25Lexical keyword matching0.018ms
RRF FusionCombines rankings, k=600.001ms

Both indexes are memory-mapped (mmap with MAP_PRIVATE, MADV_SEQUENTIAL) for near-instant startup and zero-copy access.

The fusion uses pre-allocated score buffers — one float per chunk in the corpus — with a touched_ vector that tracks which entries were modified. Between queries, only the touched entries are reset: O(candidates) instead of O(n_chunks).

On-Device Embedding

We use Snowflake Arctic Embed S (33M parameters, 384 dimensions, Q8_0 quantized) running via llama.cpp with Metal GPU offloading.

The embedding model and the LLM share Apple Silicon's unified memory through dual Metal GPU contexts — the embedding model (~33MB) coexists with the LLM without context switching overhead.

The Embedding Cache

Embedding is the most expensive retrieval operation at 5.68ms per query. For voice workloads, users often repeat or rephrase similar queries. We built a frequency-weighted LRU cache:

Cache hit rate over time
Embedding cache hit rate and latency

Eviction policy:

text
1score(e) = √frequency(e) / (1 + age_seconds(e))

The square root dampens frequency to prevent popular entries from being permanently pinned. The age term ensures stale entries eventually get evicted.

MetricValue
Hit rate99.9%
Hit latency0.015μs
Miss latency3.04ms
Speedup on hits255,000x
StoragePre-allocated contiguous float pool

The cache uses a pre-allocated contiguous float[] pool of max_entries × dim with O(1) lookup via unordered_map. No allocation jitter. No cache-miss spikes.

The Top-k Tradeoff

More retrieved chunks means better grounding but higher LLM prefill cost. We swept top-k from 1 to 10:

Top-k sweep results
TTFT and first-audio latency vs number of retrieved chunks
top-kRetrievalLLM TTFTFirst-AudioTTFT Growth
15.92ms29.1ms159.8msbaseline
25.91ms31.8ms166.6ms+9%
32.61ms36.0ms159.6ms+24%
55.84ms57.7ms177.8ms+98%
74.05ms72.2ms175.9ms+148%
105.76ms110.0ms184.8ms+278%

Two things jump out:

  1. TTFT scales 3.8x from k=1 to k=10 — the LLM has to prefill all those extra context tokens
  2. First-audio only grows 16% over the same range — word-level flushing absorbs the TTFT increase

This is the key result. Word-level streaming flush, originally designed for bare LLM inference, becomes even more valuable when retrieval context inflates TTFT. The flushing mechanism decouples what the user hears from how much context the LLM processes.

Retrieval Mode Comparison

Is hybrid retrieval worth the extra cost over single-mode?

Retrieval mode comparison
Hybrid vs vector-only vs BM25-only
ModeRetrievalTTFTFirst-Audio
Hybrid (RRF)4.16ms55.0ms175.9ms
Vector only2.90ms55.2ms166.6ms
BM25 only2.66ms58.7ms191.7ms

Hybrid adds 1.3–1.5ms over single-mode retrieval. The cost is dominated by the embedding computation, not the fusion logic. Vector-only offers a 7% first-audio advantage if you're willing to sacrifice lexical matching.

All three modes stay under 200ms first-audio.

How We Compare

We compared FastVoice RAG against every on-device or private RAG system we could find:

SystemLanguageOn-DeviceVoice IntegrationCloud-FreeFirst-Audio
FastVoice RAGC++YesYesYes<200ms
LlamaIndexPythonNoNoNoN/A
RAGFlowPythonNoNoNoN/A
PrivateGPTPythonPartialNoPartialMulti-second

FastVoice RAG is the only system that is fully on-device, written in C++, voice-integrated, and cloud-free.

What This Enables

Voice-First Knowledge Assistants

Ask your device questions about your documents — and hear the answer in under 200ms:

  • Medical professionals: Query drug interactions, treatment protocols, patient histories — all on-device, fully HIPAA-compliant
  • Legal research: Search case law and statutes by voice while reviewing documents
  • Field engineers: Access equipment manuals and troubleshooting guides offline

Privacy-Critical RAG

The entire pipeline — embedding, retrieval, generation, and speech — runs locally:

  • Sensitive corporate documents never leave the device
  • No API keys, no usage logs, no third-party data processing
  • Compliant by architecture, not by policy

Offline Knowledge Access

  • Aircraft maintenance crews accessing technical manuals mid-flight
  • Emergency responders querying protocols in connectivity dead zones
  • Researchers working with classified or embargoed data

The Numbers That Matter

Retrieval Performance:

  • <4ms total retrieval latency (hybrid, top-k=5)
  • 0.015μs embedding cache hit latency (99.9% hit rate)
  • 255,000x speedup on cache hits
  • 0.001ms RRF fusion time

Pipeline Performance:

  • <200ms first-audio with full RAG (top-k=5)
  • 16% first-audio growth from k=1 to k=10 (word-level flushing)
  • 3.8x TTFT growth absorbed by streaming

System Properties:

  • 5,016 chunks indexed and searchable
  • Zero cloud dependencies
  • Zero allocations in the hot path
  • Dual Metal GPU contexts for concurrent embedding + LLM

Summary

RAG doesn't have to be slow. RAG doesn't have to be in the cloud.

FastVoice RAG proves that hybrid retrieval-augmented generation can run in a real-time voice pipeline — entirely on-device — with first-audio latency under 200ms. The retrieval itself adds almost nothing. The LLM prefill cost is real, but word-level streaming flush absorbs it.

The architecture: memory-mapped zero-copy indexes, pre-allocated fusion buffers, frequency-weighted embedding cache, and dual Metal GPU inference — all composed into a single C++ binary on Apple Silicon.

No cloud. No network. No waiting.


Evaluated on Apple M3 Max, 36GB unified memory, macOS Sonoma. Corpus: 5,016 chunks from 1,030 CS documents. Models: Qwen3 0.6B Q4_K_M (KV cache Q8_0), Snowflake Arctic Embed S Q8_0, Piper Amy medium. STT: Streaming Zipformer via sherpa-onnx. File-mode benchmarks, 5 runs. Embedding cache: frequency-weighted LRU. Retrieval: BM25 + HNSW + RRF (k=60).

RunAnywhere Logo

RunAnywhere

Connect with developers, share ideas, get support, and stay updated on the latest features. Our Discord community is the heart of everything we build.

Company

Copyright © 2025 RunAnywhere, Inc.