Unweight is a lossless compression system designed to reduce LLM weight size and alleviate memory bandwidth bottlenecks during inference on NVIDIA H100 GPUs. By compressing the exponent bytes of BF16 weights, the system...

Unweight: how we compressed an LLM 22% without sacrificing quality

Summary

Key Points

Achieves a 15–22% reduction in total model size and approximately 3 GB of VRAM savings (demonstrated on Llama-3.1-8B).
Targets MLP weight matrices (gate, up, and down projections), which represent roughly two-thirds of model parameters and dominate memory traffic.
Utilizes Huffman coding to compress the 8-bit exponent of BF16 weights, leveraging the fact that the top 16 exponents account for over 99% of all weights.
Implements four execution pipelines—Full decode, Exponent-only decode, Palette transcode, and Direct palette skip—to optimize for varying batch sizes and matrix shapes.
Decompresses weights within fast on-chip shared memory (SMEM) to avoid additional round-trips through High Bandwidth Memory (HBM).
Avoids per-element branching in the execution hot path by processing weights in rows of 64; if a row contains an exponent outside the top-16 palette, the entire row is stored uncompressed.

Technical Details

The compression mechanism exploits the high redundancy in the 8-bit exponent of BF16 values. While the 1-bit sign and 7-bit mantissa appear as random data, the exponent distribution is highly skewed, allowing Huffman coding to compress the exponent stream by approximately 30%. To maintain performance, the system avoids the overhead of checking every weight for edge cases by using a row-based approach where any weight with a rare exponent triggers a verbatim storage of the entire 64-weight row.

The system's efficiency depends on selecting the appropriate execution pipeline based on the workload. For small batch sizes (1–64 tokens), the "Full decode" pipeline is used, which reconstructs the original BF16 weights and utilizes standard NVIDIA cuBLAS libraries. For larger batch sizes (256+ tokens), the system employs "Exponent-only" or "Palette transcode" pipelines. These use custom kernels to reconstruct BF16 values in shared memory (SMEM) and feed them directly to the tensor cores. This strategy leverages the GPU's available compute capacity to perform decompression, effectively mitigating the bottleneck caused by the HBM-to-SMEM memory bus.

Impact / Why It Matters

Unweight enables higher model density on a single GPU, allowing for more efficient and cost-effective large-scale inference. It provides a method to increase throughput on Hopper-class GPUs without the accuracy degradation associated with lossy quantization techniques.

Unweight: how we compressed an LLM 22% without sacrificing quality

Unweight: how we compressed an LLM 22% without sacrificing quality

Summary

Key Points

Technical Details

Impact / Why It Matters

↳ Sources