Welcome Gemma 4: Frontier multimodal intelligence on device
Summary
Gemma 4 is a new family of open-weights multimodal models released under the Apache 2 license, designed for both on-device and large-scale deployment. The series supports text, image, audio, and video inputs, featuring specialized architectures for efficient long-context processing and high-performance inference.
Key Points
- The model family includes four primary variants: Gemma 4 E2B (2.3B effective/5.1B with embeddings), Gemma 4 E4B (4.5B effective/8B with embeddings), Gemma 4 31B (dense), and Gemma 4 26B A4B (Mixture-of-Experts with 4B active parameters).
- Context windows scale from 128k tokens for the E2B and E4B models to 256k tokens for the 31B and 26B A4B models.
- Audio input capabilities are specifically supported in the smaller E2B and E4B variants.
- The 31B dense model achieved an estimated LMArena text-only score of 1452, while the 26B MoE model reached 1441.
- The vision encoder supports variable aspect ratios and configurable token budgets, allowing for inputs ranging from 70 to 1120 tokens.
- The model family supports various tasks including OCR, speech-to-text, object detection, GUI element detection (via JSON output), and multimodal function calling.
Technical Details
Gemma 4 utilizes an architecture characterized by alternating local sliding-window attention layers (512 tokens for smaller models, 1024 for larger) and global full-context attention layers. To support extended context, the model employs dual RoPE configurations: standard RoPE for sliding layers and pruned RoPE for global layers. A notable feature is Per-Layer Embeddings (PLE), which introduces a parallel, low-dimensional conditioning pathway alongside the main residual stream. PLE provides a dedicated vector for every decoder layer by combining token-identity and context-aware components, allowing for layer-specific specialization at a minimal parameter cost.
Inference efficiency is optimized through a Shared KV Cache, where the final $N$ layers of the model reuse key-value tensors from the last non-shared layer of the same attention type, reducing both memory footprint and compute requirements. The vision encoder uses learned 2D positions and multidimensional RoPE to preserve original aspect ratios. For audio processing, the E2B and E4B models utilize a USM-style conformer architecture. The models are designed for broad compatibility across deployment frameworks including Transformers, llama.cpp, MLX, WebGPU, and Rust.
Impact / Why It Matters
The combination of Apache 2 licensing and optimized architectures like MoE and Shared KV Cache enables developers to deploy sophisticated multimodal capabilities—such as object detection and video understanding—directly on edge devices and local hardware.