Introducing Storage Buckets on the Hugging Face Hub
Summary
Hugging Face has introduced Storage Buckets, a mutable, S3-like object storage layer on the Hub designed for high-throughput ML artifacts such as training checkpoints, optimizer states, and processed datasets. Built on the Xet backend, Buckets provide efficient, chunk-based deduplication to optimize storage and transfer speeds for related AI workloads.
Key Points
- Buckets provide non-versioned, S3-like object storage accessible via the
hf://buckets/username/bucket-nameURI scheme. - The backend utilizes Xet's chunk-based storage to enable content deduplication across files, reducing bandwidth and storage costs.
- Python integration is available via
huggingface_hubversion 1.5.0 and above, supporting functions likecreate_bucket,list_bucket_tree, andsync_bucket. - JavaScript support is available via
@huggingface/hubversion 2.10.5 and above. - The
HfFileSystem(fsspec-compatible) allows direct integration withpandas,Polars, andDaskusinghf://paths. - "Pre-warming" functionality allows users to move data to specific cloud regions (currently AWS and GCP) to minimize latency for distributed training clusters.
Technical Details
Buckets function as non-versioned containers under user or organization namespaces, utilizing the Xet backend to break files into chunks rather than treating them as monolithic blobs. This architecture allows the system to skip bytes that already exist in the storage layer, which is particularly efficient for uploading successive model checkpoints or iteratively processed datasets. For Enterprise customers, billing is calculated based on this deduplicated storage footprint.
Management is handled through the hf CLI, which supports commands for creating buckets (hf buckets create), syncing directories (hf buckets sync), copying files (hf buckets cp), and removing objects (hf buckets remove). The sync command includes --dry-run for execution planning and --plan <file>.jsonl for saving and later applying transfer plans. Through HfFileSystem, developers can implement standard filesystem operations—such as ls, glob, and open—directly on remote bucket contents, enabling seamless integration into existing data pipelines without requiring local data staging.
Impact / Why It Matters
Storage Buckets provide a high-performance, mutable layer for managing intermediate ML artifacts that are too large or frequently updated for Git-based repositories. This allows developers to maintain a continuous workflow from raw data processing and training to final publication in versioned model or dataset repositories.