Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers
Summary
This entry details the process of finetuning multimodal embedding and reranker models using the Sentence Transformers library, specifically for tasks like Visual Document Retrieval (VDR). It demonstrates how domain-specific finetuning can significantly improve retrieval accuracy for complex visual inputs such as charts, tables, and document layouts.
Key Points
- Finetuning
Qwen/Qwen3-VL-Embedding-2Bfor Visual Document Retrieval (VDR) improved the NDCG@10 metric from 0.888 to 0.947. - The
SentenceTransformerclass can load multimodal models and automatically detect supported modalities (e.g., text, image, video) by inspecting the model's processor. - The
Routermodule allows for the composition of separate, lightweight encoders (e.g.,all-MiniLM-L6-v2for text andsiglip2-base-patch16-224for images) into a single multimodal architecture. - Training multimodal models requires aligning embedding spaces, often necessitating a
Denseprojection layer when using aRouterwith separate encoders. - The training pipeline supports diverse input types, including PIL images, file paths, URLs, and audio/video arrays, with the data collator handling preprocessing via
model.preprocess()automatically. CachedMultipleNegativesRankingLossis a primary loss function used for retrieval tasks involving (anchor, positive, hard negative) triplets.
Technical Details
When loading existing multimodal models, the SentenceTransformer constructor accepts processor_kwargs to control AutoProcessor parameters (such as max_pixels for image resolution) and model_kwargs to configure AutoModel parameters (such as torch_dtype or attn_implementation). If starting from a fresh Vision-Language Model (VLM) checkpoint that lacks an embedding structure, the Transformer module attempts to infer modalities from the processor and automatically adds Pooling layers as needed.
For modular architectures, the Router module can route specific inputs to specialized sub-modules based on the detected modality. While this allows for using specialized encoders, the resulting embedding spaces are initially unaligned and require training to achieve cross-modal similarity. The dataset format must be structured so that all columns—aside from the "label" or "score" column—are treated as inputs. The data collator manages the complexity of multimodal inputs by automatically applying the appropriate preprocessing logic for text, images, audio, or video based on the model's configuration.
Impact / Why It Matters
Developers can achieve state-of-the-art performance on specialized retrieval tasks by finetuning general-purpose VLMs on domain-specific datasets. This allows for highly accurate retrieval of complex visual documents, significantly outperforming much larger models that have not undergone task-specific finetuning.