Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Summary

This entry details the process of finetuning multimodal embedding and reranker models using the Sentence Transformers library, specifically for tasks like Visual Document Retrieval (VDR). It demonstrates how domain-specific finetuning can significantly improve retrieval accuracy for complex visual inputs such as charts, tables, and document layouts.

Key Points

Finetuning Qwen/Qwen3-VL-Embedding-2B for Visual Document Retrieval (VDR) improved the NDCG@10 metric from 0.888 to 0.947.
The SentenceTransformer class can load multimodal models and automatically detect supported modalities (e.g., text, image, video) by inspecting the model's processor.
The Router module allows for the composition of separate, lightweight encoders (e.g., all-MiniLM-L6-v2 for text and siglip2-base-patch16-224 for images) into a single multimodal architecture.
Training multimodal models requires aligning embedding spaces, often necessitating a Dense projection layer when using a Router with separate encoders.
The training pipeline supports diverse input types, including PIL images, file paths, URLs, and audio/video arrays, with the data collator handling preprocessing via model.preprocess() automatically.
CachedMultipleNegativesRankingLoss is a primary loss function used for retrieval tasks involving (anchor, positive, hard negative) triplets.

Technical Details

When loading existing multimodal models, the SentenceTransformer constructor accepts processor_kwargs to control AutoProcessor parameters (such as max_pixels for image resolution) and model_kwargs to configure AutoModel parameters (such as torch_dtype or attn_implementation). If starting from a fresh Vision-Language Model (VLM) checkpoint that lacks an embedding structure, the Transformer module attempts to infer modalities from the processor and automatically adds Pooling layers as needed.

For modular architectures, the Router module can route specific inputs to specialized sub-modules based on the detected modality. While this allows for using specialized encoders, the resulting embedding spaces are initially unaligned and require training to achieve cross-modal similarity. The dataset format must be structured so that all columns—aside from the "label" or "score" column—are treated as inputs. The data collator manages the complexity of multimodal inputs by automatically applying the appropriate preprocessing logic for text, images, audio, or video based on the model's configuration.

Impact / Why It Matters

Developers can achieve state-of-the-art performance on specialized retrieval tasks by finetuning general-purpose VLMs on domain-specific datasets. This allows for highly accurate retrieval of complex visual documents, significantly outperforming much larger models that have not undergone task-specific finetuning.

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Summary

Key Points

Technical Details

Impact / Why It Matters

↳ Sources