Ulysses Sequence Parallelism: Training with Million-Token Contexts

Summary

Ulysses Sequence Parallelism, part of the Arctic Long Sequence Training (ALST) protocol, enables training with million-token contexts by distributing attention computation across multiple GPUs. It utilizes attention head parallelism to overcome the quadratic memory and compute limitations of standard transformer attention mechanisms.

Key Points

Implements sequence sharding combined with attention head redistribution via all-to-all collective operations.
Reduces communication complexity to $O(d/p)$ per GPU (where $d$ is hidden dimension and $p$ is parallelism degree), offering lower latency and volume than Ring Attention's $O(s \cdot d/p)$.
Integrates with Hugging Face Accelerate through the ParallelismConfig and DeepSpeedSequenceParallelConfig classes.
Utilizes position_ids instead of 4D attention_mask tensors to maintain causal masking without the prohibitive memory overhead of large mask matrices.
Requires the pad_to_multiple_of parameter in SFTConfig to match the sp_size to ensure sequence divisibility across GPUs.
Automates dataloader wrapping via UlyssesSPDataLoaderAdapter and specialized loss computation when used within the Transformers Trainer.

Technical Details

Ulysses Sequence Parallelism operates by partitioning the input sequence dimension across GPUs and then performing an all-to-all communication step. During this step, the data is redistributed so that each GPU holds all sequence positions for a specific subset of attention heads. This allows each GPU to perform local attention computation using optimized kernels like FlashAttention-2 or SDPA. The process concludes with a second all-to-all operation to return the data to its original sequence-sharded format before the output projection.

For developers implementing custom training loops using Accelerate, manual weighted loss aggregation is required to maintain gradient accuracy. Because tokens may be unevenly distributed across ranks (e.g., due to padding or masked tokens), the loss must be aggregated by gathering both the loss and the count of valid tokens (shift_labels != -100) across the sequence parallel group. In contrast, the Transformers Trainer and TRL SFTTrainer handle this complexity automatically, including the management of the sp_backend (which must be set to "deepspeed") and the adjustment of the effective data parallel world size ($dp_world_size = world_size // sp_size$).

Impact / Why It Matters

Ulysses Sequence Parallelism allows developers to train models on massive datasets, such as entire codebases or long legal documents, by bypassing the single-GPU memory bottlenecks of traditional attention. It provides a high-efficiency, low-latency alternative to Ring Attention for large-scale, long-context model training.

Ulysses Sequence Parallelism: Training with Million-Token Contexts

Ulysses Sequence Parallelism: Training with Million-Token Contexts

Summary

Key Points

Technical Details

Impact / Why It Matters

↳ Sources