Introducing Modular Diffusers - Composable Building Blocks for Diffusion Pipelines
Summary
Modular Diffusers introduces a composable architecture for diffusion pipelines, replacing the standard DiffusionPipeline with a system of interchangeable, self-contained blocks. This allows developers to construct highly customized workflows by adding, removing, or swapping individual functional units such as text encoders, denoisers, and decoders.
Key Points
- Introduces the
ModularPipelineclass for executing workflows composed of independent, programmable blocks. - Supports custom block development via Python classes inheriting from
ModularPipelineBlocks, which define specificinputs,intermediate_outputs, andexpected_components. - Implements
ComponentsManagerfor automated memory management and model offloading to CPU when components are not actively in use. - Introduces "Modular Repositories" using
modular_model_index.jsonto reference components from disparate repositories (e.g., a quantized transformer from one repo and a VAE from another). - Enables automatic data flow between blocks in a sequence, where the output of one block (e.g., a
control_image) is automatically passed to a downstream block requiring that specific input. - Integrates with Mellon, a node-based visual workflow interface for wiring blocks together.
Technical Details
The architecture is built around the concept of self-contained blocks that define their own computation logic within a __call__ method. Each block uses ComponentSpec to declare necessary models, which are automatically fetched via load_components or update_components. This allows for complex pipeline manipulation, such as extracting a specific sub-workflow (e.g., controlnet_text2image) and inserting a custom DepthProcessorBlock at a specific index.
The modular_model_index.json configuration allows for advanced repository structures where a single pipeline can reference a quantized transformer from a specialized repository while pulling the standard VAE from the original model repository. For large-scale models like Krea Realtime Video (14B parameters), the system supports high-performance execution, achieving 11fps on a single NVIDIA B200 GPU. The system also supports the use of type_hint within the JSON configuration to ensure correct model loading and compatibility across the pipeline.
Impact / Why It Matters
This modularity allows developers to rapidly prototype and share complex, multi-stage generative architectures as easily as single models. It also simplifies the deployment of optimized, quantized versions of existing models by allowing users to swap only the necessary components without rebuilding the entire pipeline.