DuckLake 1.0: Data Lake Format with SQL Catalog Metadata

Summary

DuckDB Labs has released DuckLake 1.0, a data lake format that stores table metadata in a SQL database rather than as individual files in object storage. This architecture is designed to mitigate the performance bottlenecks and coordination complexities associated with file-based metadata management in traditional lakehouse formats.

Key Points

DuckLake 1.0 is available as a DuckDB extension under the MIT license.
Unlike Apache Iceberg, Delta Lake, or Apache H3, DuckLake stores metadata directly in a SQL database to avoid the "small file problem" and slow metadata operations.
The release includes "data inlining," which allows small inserts, updates, and deletes to be processed within the catalog database; the default threshold for inlining is 10 rows.
Technical features include sorted tables for faster filtered queries, bucket partitioning for high-cardinality columns, and deletion vectors compatible with the Iceberg specification.
Client support is available for Apache DataFusion, Apache Spark, Trino, and Pandas.
The development roadmap includes v1.1 (variant inlining and multi-deletion vector Puffin files) and v2.0 (Git-like branching and built-in role-based permissions).

Technical Details

DuckLake 1.0 addresses the overhead of file-based metadata management by utilizing a SQL catalog to handle table state. A core feature is the implementation of data inlining, which enables small-scale write operations (inserts, updates, and deletes) to be recorded directly in the catalog database. This prevents the creation of numerous small files in object storage, which typically leads to complex coordination and degraded performance in standard lakehouse architectures.

The format also introduces optimized storage structures, such as sorted tables to accelerate filtered queries and bucket partitioning to manage high-cardinality columns. For interoperability, DuckLake 1.0 supports geometry data types and maintains compatibility with Iceberg-style deletion vectors. While the primary implementation is a DuckDB extension, MotherDuck provides a hosted service that manages both the catalog database and the underlying storage.

Impact / Why It Matters

Developers can reduce storage fragmentation and metadata latency when performing frequent, small-batch updates to large datasets. The availability of clients for Spark, Trino, and DataFusion allows for the integration of this high-performance metadata approach into existing distributed data processing pipelines.

DuckLake 1.0: Data Lake Format with SQL Catalog Metadata

DuckLake 1.0: Data Lake Format with SQL Catalog Metadata

Summary

Key Points

Technical Details

Impact / Why It Matters

↳ Sources