Gemma 4 Ignites the KV-Cache Wars

TL;DR

Gemma 4 became the ecosystem stress test: vLLM, SGLang, llama.cpp, Ollama, and Apple MLX all shipped support or rapid follow-up fixes, turning one model launch into a full-stack compatibility event.1
KV-cache compression moved from research topic to product roadmap: vLLM, lmdeploy, ExecuTorch, mlx-vlm, and omlx all touched quantized cache or TurboQuant-style work in the same week.6
Apple Silicon is no longer a side quest: Ollama, mlx, mlx-lm, and ExecuTorch all treated Apple deployment as a first-class inference target rather than a downstream port.11
Tool calling became an inference-engine feature, not just an app feature: vLLM, Ollama, omlx, and text-generation-webui all shipped parser, streaming, or MCP-related changes.15
Edge and heterogeneous inference kept gaining ground: LiteRT-LM, MNN, OpenVINO, and ONNX Runtime all expanded hardware paths, packaging, or deployment coverage across GPU, NPU, WebGPU, and CPU.19

This Week in Inference

The market split that’s been forming for months is now obvious. On one side are larger open models aimed at long-horizon agents and multimodal product experiences, including Google DeepMind’s Gemma 4 family, Z.AI’s GLM-5.1, Meta’s Muse Spark, Microsoft’s MAI model trio, and Alibaba’s Qwen 3.6-Plus, all highlighted in this week’s market briefing through external coverage (Gemma 4 coverage, GLM-5.1 coverage, Muse Spark coverage, Qwen 3.6-Plus coverage).23 On the other side are smaller, more aggressively optimized deployment targets — phones, Macs, browsers, NPUs, Raspberry Pi-class devices — where the real competition is not benchmark glory but whether a model can run reliably, cheaply, and with acceptable latency.

That’s why Gemma 4 mattered so much this week. It wasn’t just another model release; it was a forcing function. Across serving engines and local runtimes, the first wave was architecture support, and the second wave was all the hard stuff: tokenizer edge cases, reasoning parsers, tool-call formatting, flash-attention fallbacks, quantized loading bugs, and multimodal regressions. You can see that pattern in vLLM, SGLang, llama.cpp, Ollama, mlx-lm, mlx-vlm, and Transformers.1 The common thread: inference stacks are now judged on how quickly they can absorb a new model family without breaking structured outputs, long sessions, or hardware portability.

The deeper strategic shift is memory. TurboQuant-style KV-cache compression was one of the week’s biggest market themes in research coverage (TurboQuant summary), but the more important signal is that open-source projects are already converging on the same problem from different directions.29 vLLM shipped per-token-head KV-cache quantization, ExecuTorch merged KV compression work, lmdeploy saw direct demand for TurboQuant support, and Apple-adjacent stacks like mlx-vlm and omlx pushed their own TurboQuant implementations.6 The industry takeaway is straightforward: as models become more agentic and sessions get longer, the bottleneck is increasingly cache residency, not just model weights.

Hardware news reinforced the same point. NVIDIA’s Marvell investment and Intel’s heterogeneous inference partnership with SambaNova, both surfaced in the market briefing, point toward a world where orchestration, prefill, decode, and retrieval are spread across different compute tiers rather than pinned to one monolithic accelerator (NVIDIA-Marvell coverage, Intel-SambaNova coverage).30 Open-source code is moving in the same direction: LiteRT-LM pushed NPU speculative decoding, OpenVINO expanded CPU/GPU/NPU GenAI support, MNN added MUSA and RVV, and ONNX Runtime kept broadening plugin execution paths.19 Cloud, local, and edge are no longer separate markets; they’re one stack with different latency and memory budgets.

Deeper Dive

Everything below is for readers who want the full picture. Feel free to scroll.

Code Changes by Category

Cloud & Datacenter Serving

vLLM had the week’s most important datacenter-serving release.1 The headline was full Gemma 4 architecture support across MoE, multimodal, reasoning, and tool-use paths (support work), but the more durable change was memory efficiency: per-token-head INT8 and FP8 KV-cache quantization landed in core serving paths (KV quantization), and fused FP8/NVFP4 output quantization arrived in MLA attention (MLA quantization).6 Around that, vLLM spent heavily on parser and streaming correctness for Gemma and Qwen tool use, including unified Responses API parser migration and multiple structured-output fixes (parser migration, Gemma reasoning fix, streaming tool-call fix).15 The pattern is clear: serving engines are becoming protocol interpreters as much as schedulers.

SGLang matched or exceeded everyone on raw throughput of engineering work.2 Gemma 4 support landed early in the week (Gemma support), but the more interesting work was underneath: dynamic HTTP support for multiple speculative models (spec decode HTTP), fixes for hybrid linear attention with n-gram speculation (crash fix), and a broad accelerator push spanning NPU, AMD, and Blackwell-class GPUs (NPU cache enablement, AMD fix, Blackwell default path).37 SGLang also expanded into ASR and scoring with new transcription adapters and SequenceClassification support (ASR adapter, score API).43

lmdeploy had a quieter but strategically important week.45 The release centered on broader Qwen3.5 support, especially MTP and Ascend paths (Qwen3.5 MTP, Ascend support), plus prefix-caching fixes on Ascend (prefix cache fix).46 More telling was the community demand around KV-cache compression and TurboQuant support (TurboQuant issue, KV compression issue).7 That’s a strong signal that cache economics are now a user-facing feature request, not just backend optimization.

Triton Inference Server was comparatively quiet, with only light maintenance activity visible in the supplied data.50 ai-dynamo showed heavy org-level throughput but lacked repo-level detail in the source summaries, so it remains a “watch for next week” name rather than a reportable engineering story.51

Local LLM Runtimes

llama.cpp spent the week doing what it often does best: turning a major model launch into a brutal compatibility gauntlet and then fixing it in public.3 Gemma 4 drove parser, vocab, KV-cache, and end-of-generation fixes (parser support, KV rotation support, EOG fix), while backend work continued across Metal, CUDA, WebGPU, Vulkan, and MMQ kernels (Metal Q1_0, CUDA graph work, WebGPU iOS, Vulkan FA dequant).52 The repo’s issue traffic around Gemma 4 bad tokens and looping output made the broader point: local runtimes are now expected to absorb frontier model complexity at near-release speed (Gemma issue).59

Ollama translated that same pressure into a productized local runtime.4 Gemma 4 support landed first (initial support), then came the real work: flash-attention enablement, rollback, and selective disablement on older CUDA hardware (enable FA, revert, older GPU fallback).60 Tool-calling and parser fixes followed quickly (tool-call handling, quoted args fix), alongside tokenizer robustness improvements like byte fallback in SentencePiece BPE (byte fallback).16 The project also kept pushing Apple-specific performance with NAX and MLX-related work (M5 performance, MLX HTTP client).66

LocalAI had a feature-heavy week that looked increasingly like a local control plane rather than just a wrapper around backends.68 The new Kokoros backend, autocomplete-enabled model config editor, and coding-agent discoverability work all point in that direction (Kokoros, config editor, agent discoverability).69 It also tracked the Gemma 4 wave with gallery additions and tokenization fixes (Gemma gallery), while improving Anthropic compatibility and distributed-node resilience (Anthropic fix, node failure detection).73

text-generation-webui had one of the week’s most interesting UI-layer changes: MCP server support directly in chat (MCP support).75 That’s notable because it pulls orchestration and tool discovery closer to the runtime surface. The repo also refreshed llama.cpp integrations, fixed Gemma 4 prompt and tool-call edge cases, tightened security defaults around trust_remote_code, and patched restart/networking issues (llama backend refresh, tool-call truncation fix, security hardening).76

Apple Silicon & MLX Ecosystem

mlx-lm was the center of gravity in Apple’s inference stack this week.5 Gemma 4 support landed first (Gemma support), then came tool-calling support, parser fixes for multi-token think/tool markers, quantized loading fixes, speculative decoding corruption fixes, and cache/batch control restoration (tool calling, parser fix, quantized load fix, spec decode fix).13 This is exactly what a maturing runtime looks like: model onboarding followed immediately by correctness work in the hard-to-test paths.

mlx itself focused on core runtime quality.12 CUDA thread-safety improvements, quantized gather matmul support on CUDA, transformer decoder correctness fixes, and build/install reliability all landed in the same week (CUDA thread safety, quantized gather matmul, decoder fix, install fix).12 The significance is subtle but important: MLX is no longer just “Apple-only.” It is becoming a broader runtime substrate with Apple as the flagship deployment target.

mlx-vlm had one of the busiest multimodal weeks anywhere in open inference.27 Gemma 4 support expanded across vision, audio, and MoE (Gemma support), then quickly accumulated fixes for chunked prefill, vision-text degradation, processor config gaps, nested tool arguments, multi-image handling, and cache aliasing (chunked prefill, vision/text fix, tool args fix, cache aliasing).86 TurboQuant was the standout theme: introduced, documented, optimized with Metal kernels, and then race-condition patched (TurboQuant intro, optimization, race fix).9 This is one of the clearest examples of research-grade memory ideas moving into user-facing local tooling.

omlx had a similarly intense week on Apple-centric serving.93 Gemma 4 native tool calling landed through updated MLX dependencies (native tool calling), TurboQuant returned and expanded to more bit-widths (TurboQuant expansion), and the audio stack gained zero-shot voice-cloning primitives through ref_audio and ref_text request support (audio request, speech endpoint).10 The project also simplified concurrency tuning with a single max_concurrent_requests knob (concurrency simplification), which is exactly the kind of operational polish that turns a local runtime into a deployable service.96

Mobile & Edge Frameworks

LiteRT-LM had the most substantial pure edge-runtime week.19 The project refactored loader and session architecture for streaming-aware operation (loader refactor, session refactor), added async messaging plus checkpoint and rewind APIs (async send, rewind API), and pushed NPU speculative decoding through an MTP drafter pipeline (NPU speculation).97 That’s a serious systems story, not just a model-support update.

MNN shipped one of the strongest edge releases of the week.20 New MUSA support extended the runtime into another accelerator ecosystem (MUSA support), RVV support broadened CPU portability (RVV support), and text-level prompt caching improved multi-turn on-device chat (prompt cache).102 Linear attention optimizations on OpenCL and Metal rounded out the release (OpenCL optimization, Metal optimization).105

ExecuTorch had a backend-expansion week that deserves more attention than it got.107 The new MLX delegate series is the headline (delegate part 1, part 2, part 3), but the broader story is that ExecuTorch is trying to be the bridge across Apple, Qualcomm, Vulkan, Arm, Cadence, and Cortex-M in one runtime.14 KV-cache compression work for Qwen-class models also landed in the same week (KV compression), reinforcing the memory theme.8

Compilers, Runtimes & Graph Engines

ONNX Runtime had a broad, infrastructure-heavy week.110 Model package support was the biggest single feature (model packaging), but the more strategic work was around plugin execution providers: CUDA plugin EP improvements, WebGPU plugin EP work, and Vulkan interop support (CUDA plugin work, WebGPU plugin, Vulkan interop).22 Operator coverage also moved toward modern LLM workloads with LinearAttention and related ops (LinearAttention).113 This is ONNX Runtime leaning into deployment breadth rather than chasing a single accelerator story.

OpenVINO shipped a major platform release with expanded GenAI support, including Qwen3-VL on CPU/GPU, GPT-OSS on CPU, and a preview llama.cpp backend.21 Under the hood, the week included a global zero-memory pool change for NPU runtime, unified Level Zero loading across GPU and NPU, grouped GEMM integration, and multiple Qwen GPU fixes (NPU pool change, Level Zero unification, grouped GEMM, Qwen3-VL fix).114 OpenVINO increasingly looks like the “heterogeneous enterprise inference” answer to the Apple- and CUDA-centric stacks.

XNNPACK kept doing the low-level work that quietly shapes the rest of the ecosystem.118 WASM SIMD128 dequantize_dot, int8 dot kernels, RVV and FP16 enablement, and an RMSNorm benchmark all landed this week (WASM dequantize, int8 dot, RVV enablement, RMSNorm benchmark).121 If edge and browser inference keep growing, XNNPACK’s portability work will matter more than many higher-profile model releases.

Models, Quantization & Optimization

Transformers had the week’s canonical model-library release, with Gemma 4 as the headline addition.28 But the more interesting work came after the release: MoE tensor-parallel support, export and test fixes, and a large vLLM CI compatibility sweep (MoE TP, export fix, vLLM CI).123 Continuous batching also gained per-request logits processors (continuous batching).126 The message is that model libraries are now part of serving infrastructure, not just training and fine-tuning.

Diffusers expanded model and pipeline coverage with FLUX.2 small decoder and NucleusMoE-Image support (FLUX.2 support, NucleusMoE-Image), while also investing in profiling and quantization/offloading compatibility (profiling guide, offloading fix).128 Even though this newsletter centers inference, not training, Diffusers matters because image and video inference are increasingly sharing the same deployment concerns: memory movement, offload strategy, and pipeline-level profiling.

ROCm/aiter deserves mention as one of the week’s most intense kernel-level efforts.132 Triton GEMM retuning, fused low-precision MoE kernels, error bridging to prevent worker crashes, and relaxed KV overflow behavior all landed in rapid succession (GEMM retune, fused quant MoE, error bridging, KV overflow behavior).132 This is the kind of shared infrastructure that increasingly determines whether AMD inference stacks feel production-ready.

Other Notable Changes

gallery continued to evolve as a consumer-facing showcase for on-device inference, with AICore integration and LiteRT-LM updates (AICore integration, LiteRT-LM bump).136 It’s worth watching because user issues there often surface edge-runtime problems before they’re visible in lower-level repos.

coremltools had no code movement but did surface potentially important correctness issues around InstanceNorm3d on CPU and GPU paths (GPU crash issue, CPU semantic difference).139 In a week where Apple deployment mattered more than usual, that’s a reminder that conversion and runtime correctness still lag behind model support headlines.

Community Pulse

Gemma 4 dominated issue trackers almost everywhere it landed. In llama.cpp, the main Gemma 4 eval bug thread became a focal point for parser and tokenization fixes.59 In Ollama, users quickly surfaced GPU-vs-CPU execution confusion, hangs, and older CUDA incompatibilities.141 In vLLM, the pressure was on reasoning parsers and tool validation.142 In mlx-lm and mlx-vlm, quantized loading and multimodal regressions appeared almost immediately.143 The pattern is healthy, if messy: the ecosystem is now fast enough that model launches trigger same-week stabilization loops across every layer.

TurboQuant and KV compression generated the week’s most forward-looking community energy. mlx-vlm saw especially strong enthusiasm around TurboQuant, while lmdeploy and OpenVINO-adjacent summaries showed direct user demand for cache compression and longer-context efficiency.9 That’s notable because users are no longer just asking “does this model run?” They’re asking whether it can stay resident, stay cheap, and stay responsive over long sessions.

Edge deployment friction remained a recurring theme. TensorRT-Edge-LLM users asked about LoRA compilation failures and x86 portability, LiteRT-LM users reported Windows WebGPU crashes, gallery users hit engine-creation failures, and RCLI users asked how to run unsupported local models.146 The market wants edge inference, but the installability and portability tax is still real.

Worth Watching

The biggest cross-repo pattern to watch is the convergence of tool calling, reasoning parsers, and serving APIs. vLLM, Ollama, omlx, and text-generation-webui all moved in this direction at once.15 Expect next quarter’s competition between inference engines to be partly about who handles structured outputs and agent loops most reliably under streaming.

Second, KV-cache compression is moving from “interesting optimization” to “required roadmap item.” The combination of vLLM, ExecuTorch, lmdeploy, mlx-vlm, and omlx suggests that by summer, every serious runtime will need a story for compressed or quantized cache paths.6

Third, Apple is becoming a strategic proving ground for local inference. Between MLX, Ollama, ExecuTorch, and omlx, the ecosystem is treating Apple deployment as a primary target for both consumer and developer workflows.11 That likely means more competition around Metal kernels, memory reuse, and multimodal local UX.

Major Releases

vLLM shipped a headline release in v0.19.0, centered on full Gemma 4 support, per-token-head INT8/FP8 KV-cache quantization, and a long list of platform and kernel improvements across ROCm, XPU, CPU, and low-precision paths. This was the week’s most important cloud-serving release because it combined model onboarding with real memory-efficiency gains. 1

SGLang released v0.5.10, with the official notes emphasizing piecewise CUDA graph defaults and elastic expert parallelism for partial-failure tolerance. In practice, the broader release window was about Gemma 4 support, speculative decoding maturity, and wider accelerator coverage across NVIDIA, AMD, and NPU deployments. 2

Ollama shipped five releases from v0.20.0 through v0.20.4, all focused on rapid-response Gemma 4 support, flash-attention stabilization, parser and tool-calling fixes, and Apple Silicon performance tuning. The cadence reflects a tight ship-observe-patch loop around a major local-model launch. Latest release 11

ggml / llama.cpp published a dense stream of rolling releases through the week, reflecting nonstop stabilization around Gemma 4, backend support, and server/runtime fixes. The dominant theme was compatibility hardening across Metal, CUDA, WebGPU, Vulkan, and parser behavior rather than one monolithic feature drop. All releases 3

LocalAI shipped four releases from v4.1.0 through v4.1.3, combining a feature push — including the Kokoros backend and stronger web configuration UX — with rapid follow-up fixes for Gemma 4 tokenization, Anthropic compatibility, and login/API regressions. The most important takeaway is that LocalAI is increasingly acting like a local inference platform, not just a backend wrapper. Latest release 68

Apple MLX / mlx-lm shipped v0.31.2, a release anchored by system and user message caching for non-trimmable caches, batch generator refactoring, and the surrounding Gemma 4 support wave. The week’s broader MLX story was simultaneous expansion in model support, parser correctness, and runtime maturity. 5

Hugging Face Transformers released v5.5.0, with Gemma 4 as the headline addition and a follow-on stabilization cycle around MoE tensor parallelism, export correctness, and serving compatibility. The release mattered less as a library milestone than as a forcing function for downstream inference stacks. 28

OpenVINO shipped 2026.1.0, expanding GenAI model coverage across CPU, GPU, and NPU, including Qwen3-VL and a preview llama.cpp backend. The dominant theme was heterogeneous deployment: more model breadth, more backend unification, and more enterprise-ready runtime plumbing. 21

LiteRT-LM released v0.10.1, packaging Gemma 4 support alongside a broader push on session architecture, streaming, and hardware deployment paths. The most impactful change was not just model support but the continued evolution of LiteRT-LM into a serious on-device runtime. 19

Google AI Edge gallery shipped 1.0.11, focused on Gemma 4 offline models plus Agent Skills and community skill loading. The release reinforces gallery’s role as the user-facing front end for Google’s on-device inference stack. 136

MNN released 3.5.0, a substantial edge-inference update centered on broader backend coverage, prompt caching for multi-turn chat, and LLM-oriented optimization across OpenCL, Metal, MUSA, and RVV. This was one of the week’s strongest mobile and edge releases. 20

lmdeploy shipped v0.12.3, combining new Qwen3.5 support, Ascend backend improvements, and a cluster of inference/runtime fixes. The release’s most important signal was continued investment in non-CUDA deployment targets and live-serving mutability. 45

text-generation-webui shipped a rapid release sequence culminating in v4.4, with MCP server support in the UI as the standout addition. The week’s releases also bundled Gemma 4 support, backend refreshes, and a series of security and stability fixes. Latest release 75

mlx-vlm shipped v0.4.3 and v0.4.4, with the latter focused on Gemma 4 stabilization, chunked prefill, multimodal fixes, VisionFeatureCache, and TurboQuant optimization. The dominant theme was fast multimodal expansion followed by equally fast hardening. Latest release 27

omlx shipped four releases from v0.3.1 through v0.3.5.dev1, all centered on making Gemma 4 practical in production on Apple-centric local serving stacks. Native tool calling, TurboQuant restoration, audio/TTS improvements, and concurrency simplification made this one of the week’s most ambitious local-serving release trains. Latest release 93

Argmax shipped argmax-sdk-swift-playground 2.0.9, packaging a major refresh around real-time transcription with speaker awareness. It was a small release in ecosystem terms, but a meaningful signal that polished on-device speech UX is becoming part of the broader inference stack. 150