A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026
Summary
The first two months of 2026 saw a significant surge in the release of advanced open-weight LLM architectures, characterized by highly efficient Mixture-of-Experts (MoE) configurations and multimodal integration. These releases focus on optimizing inference throughput, long-context handling, and specialized task performance through innovations like Multi-Token Prediction and sliding window attention.
Key Points
- Arcee AI Trinity Large: A 400B parameter MoE model with 13B active parameters, featuring a 3:1 local-to-global sliding window attention (SWA) ratio and a 4096 token window.
- Moonshot AI Kimi K2.5: A 1-trillion-parameter multimodal model that utilizes an early fusion approach, integrating vision tokens directly into the pre-training process with 15 trillion mixed tokens.
- StepFun Step 3.5 Flash: A 196B MoE model (11B active) capable of 100 tokens/sec at 128k context length, significantly outperforming DeepSeek V3.2 in throughput.
- Qwen3-Coder-Next: An 80B model with only 3B active parameters that achieves SWE-Bench Pro performance levels comparable to Claude Sonnet 4.5.
- Architectural Innovations: Recent models are increasingly adopting QK-Norm for training stability, NoPE (No Positional Embeddings) in global attention layers, and depth-scaled RMSNorm initialization.
Technical Details
Recent architectures demonstrate a shift toward optimizing the trade-off between parameter count and inference efficiency. Arcee AI’s Trinity series utilizes a specific implementation of SWA where each token attends to a fixed-size window of 4096 recent tokens, reducing complexity from $O(n^2)$ to $O(n \cdot t)$. To stabilize training in large-scale models, the Trinity architecture employs QK-Norm (applying RMSNorm to keys and queries) and a depth-scaled RMSNorm placement, where the gain of the second RMSNorm in each block is initialized to $1/\sqrt{L}$ (where $L$ is the total number of layers).
Efficiency in throughput is being driven by Multi-Token Prediction (MTP) and optimized MoE structures. Step 3.5 Flash implements MTP-3, predicting three additional future tokens ($t+1$ through $t+3$) during both training and inference to accelerate generation. In terms of MoE design, there is a trend toward using coarser expert distributions—as seen in the Trinity Large and Mistral 3 Large architectures—to improve inference throughput compared to the highly granular expert structures found in DeepSeek V3. Additionally, the use of gated attention mechanisms, which add elementwise gating to the scaled dot-product before the output linear projection, is being used to reduce attention sinks and improve long-sequence generalization.
Impact / Why It Matters
The emergence of high-throughput, small-active-parameter MoE models allows developers to deploy highly capable, specialized models (such as for coding or long-context retrieval) on much more modest hardware than previously required for trillion-parameter-class performance.