Y Combinator

Backed by Y Combinator

Inference Engines

Live

MetalRT by RunAnywhere

The fastest AI inference engine for Apple Silicon. Every GPU kernel hand-written in Metal Shading Language — LLM, speech, vision, and speech-to-speech in one C++ runtime.

metalrt benchmark
ready
0tok/sLLM decode · Qwen3-0.6B · M4 Max
Ollama · 85Apple MLX · 220llama.cpp · 290MetalRT · 658

token output stream

$0

marginal inference cost on-device

Cloud inference costs $0.08–0.35 per minute for voice alone. Serving AI to 8 billion people through centralized GPU clusters is economically impossible.

<7ms

time-to-first-token (Qwen3-0.6B, M4 Max)

A round-trip to the cloud takes 300–400ms minimum. For real-time voice, vision, and autonomous systems, physics sets the floor — on-device removes it.

658

tok/s on a single MacBook

Small models now match the quality of models 250x their size. The bottleneck isn’t the model — it’s the runtime. That’s what we build.

Benchmarks · Apple M4 Max

LLM Decode

higher is better
RunAnywhere
658 tok/s
Apple MLX
553 tok/s
llama.cpp
394 tok/s

Time to First Token

lower is better
RunAnywhere
6.6ms
Apple MLX
8ms
llama.cpp
11ms

Speech-to-Text

lower is better
RunAnywhere
101ms
Apple MLX
465ms

Speech-to-Speech

higher is better
RunAnywhere
123 tok/s
mlx-audio
81 tok/s

How it works

Built from the metal up.

We write GPU kernels from scratch — hand-designed memory layouts, fused operators, and custom Metal shaders that bypass every generic abstraction layer. MetalRT achieves 658 tok/s LLM decode on Apple Silicon. Every kernel targets the specific hardware it runs on.

The runtime orchestrates quantized weights, KV cache, and graph execution on unified memory — one C++ engine behind every modality the SDK exposes.

Inference Stack

Your App

iOS · macOS · Android

RunAnywhere.load("llama-3.2-1b")

SDK Layer

Swift · Kotlin · React Native · Flutter

Cross-platform bindings → C++ core

MetalRT Runtime

C++ Inference Engine · Quantized Weights · KV Cache

Orchestrates graph execution on unified memory

Custom .metal Kernels

Hand-written Metal Shading Language

We write every GPU kernel from scratch

qmv.metalattention_decode.metalrms_norm.metalrope.metalswiglu.metalkv_cache.metal
Apple Silicon GPU

M1 · M2 · M3 · M4 · Unified Memory · 800 GB/s

simd_sum · threadgroup_barrier · [[buffer(0)]]

Output:658 tok/s decode|101ms STT
Coming soon

Next: HexagonRT — our inference engine for Qualcomm NPUs

The same kernel-level discipline as MetalRT, built for the NPU inside billions of Android and Windows devices. Benchmarks first, launch second.

Get launch updates via Inference Radar
RunAnywhere

RunAnywhere Labs

We build the engines, SDKs, and agents that put inference where latency, cost, and privacy want it — on-prem, cloud, edge, or in between.

© 2026 RunAnywhere, Inc.