The `transformers` library has undergone a significant architectural refactor to transition from dense-model-centric loading to a specialized pipeline for Mixture of Experts (MoE) architectures. This update introduces a...

Mixture of Experts (MoEs) in Transformers

Summary

The transformers library has undergone a significant architectural refactor to transition from dense-model-centric loading to a specialized pipeline for Mixture of Experts (MoE) architectures. This update introduces a dynamic weight-loading system that enables the efficient packing of individual expert tensors into contiguous runtime tensors required by optimized kernels.

Key Points

WeightConverter Abstraction: A new mechanism that maps source checkpoint key patterns to target runtime keys using composable operations like MergeModulelist and SplitModulelist.
Loading Performance: In benchmarks using Qwen/Qwen1.5-110B-Chat, the v5 async loading strategy achieved ~20.71s, a significant improvement over the ~66s observed in v4.57.6.
Tensor Parallelism (TP) Optimization: The v5 implementation with Async loading and TP reached loading speeds of 10.1s.
Experts Backend System: Introduced via PR #42697, this system uses the @use_experts_implementation decorator to decouple expert computation from model implementation, allowing for pluggable execution backends.
Integrated Quantization: Quantization is now integrated directly into the weight-loading conversion pipeline, allowing for efficient "per-expert" quantization during the packing process.
New Configuration Controls: Developers can manage the new pipeline using HF_ENABLE_PARALLEL_LOADING for parallel shard loading and HF_DEACTIVATE_ASYNC_LOAD to revert to the synchronous pipeline.

Technical Details

The core engineering challenge addressed by this refactor is the mismatch between MoE checkpoint serialization and runtime execution requirements. While checkpoints often store each expert as an independent tensor (e.g., hundreds of separate gate_proj.weight keys), modern GPU kernels like grouped GEMMs require these weights to be packed into a single, contiguous tensor for efficient processing. The WeightConverter resolves this by treating the checkpoint as a serialized source of tensors that undergoes a transformation pipeline during loading.

The loading pipeline utilizes single-pass routing, async materialization, and conversion-aware scheduling. By registering weights as futures and using a thread pool for materialization, the loader avoids repeated scans and minimizes memory peaks. This architecture ensures that complex operations, such as MergeModulelist (which stacks experts) or SplitModulelist (which unpacks them), only execute once all necessary dependencies are loaded and ready.

Impact / Why It Matters

This refactor significantly reduces model loading latency and memory overhead when deploying large-scale sparse models like DeepSeek-V3 or Qwen. For developers and self-hosters, it enables the efficient use of high-capacity MoE models on hardware that would otherwise struggle with the memory peaks and slow initialization of unoptimized loading processes.

Mixture of Experts (MoEs) in Transformers

Mixture of Experts (MoEs) in Transformers

Summary

Key Points

Technical Details

Impact / Why It Matters

↳ Sources