Introducing Storage Buckets on the Hugging Face Hub

Summary

Hugging Face has introduced Storage Buckets, a mutable, S3-like object storage layer on the Hub designed for high-throughput ML artifacts such as training checkpoints, optimizer states, and processed datasets. Built on the Xet backend, Buckets provide efficient, chunk-based deduplication to optimize storage and transfer speeds for related AI workloads.

Key Points

Buckets provide non-versioned, S3-like object storage accessible via the hf://buckets/username/bucket-name URI scheme.
The backend utilizes Xet's chunk-based storage to enable content deduplication across files, reducing bandwidth and storage costs.
Python integration is available via huggingface_hub version 1.5.0 and above, supporting functions like create_bucket, list_bucket_tree, and sync_bucket.
JavaScript support is available via @huggingface/hub version 2.10.5 and above.
The HfFileSystem (fsspec-compatible) allows direct integration with pandas, Polars, and Dask using hf:// paths.
"Pre-warming" functionality allows users to move data to specific cloud regions (currently AWS and GCP) to minimize latency for distributed training clusters.

Technical Details

Buckets function as non-versioned containers under user or organization namespaces, utilizing the Xet backend to break files into chunks rather than treating them as monolithic blobs. This architecture allows the system to skip bytes that already exist in the storage layer, which is particularly efficient for uploading successive model checkpoints or iteratively processed datasets. For Enterprise customers, billing is calculated based on this deduplicated storage footprint.

Management is handled through the hf CLI, which supports commands for creating buckets (hf buckets create), syncing directories (hf buckets sync), copying files (hf buckets cp), and removing objects (hf buckets remove). The sync command includes --dry-run for execution planning and --plan <file>.jsonl for saving and later applying transfer plans. Through HfFileSystem, developers can implement standard filesystem operations—such as ls, glob, and open—directly on remote bucket contents, enabling seamless integration into existing data pipelines without requiring local data staging.

Impact / Why It Matters

Storage Buckets provide a high-performance, mutable layer for managing intermediate ML artifacts that are too large or frequently updated for Git-based repositories. This allows developers to maintain a continuous workflow from raw data processing and training to final publication in versioned model or dataset repositories.

Introducing Storage Buckets on the Hugging Face Hub

Introducing Storage Buckets on the Hugging Face Hub

Summary

Key Points

Technical Details

Impact / Why It Matters

↳ Sources