Embed-RL-4B: Reasoning-Driven Multimodal Embeddings

Embed-RL-4B is a 4 billion parameter multimodal embedding model developed by Tsinghua University and Kuaishou Technology. It addresses limitations in existing generative embedding methods by proposing a reasoning-driven Universal Multimodal Embedding (UME) framework. This framework integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize a Reasoner, enabling it to produce evidential Traceability CoT (T-CoT).

Key Capabilities & Innovations

Embedder-Guided Reinforcement Learning (EG-RL): A novel framework where the Embedder provides explicit supervision to the Reasoner, ensuring generated Chain-of-Thought (CoT) traces are aligned with embedding tasks.
Traceability CoT (T-CoT): Extracts critical multimodal cues to focus on retrieval-relevant elements, providing multimodal inputs for the Embedder.
Enhanced Cross-Modal Semantic Consistency: Integrates multimodal evidence in structured reasoning, paired with retrieval-oriented alignment.
Strong Performance with Limited Resources: Outperforms pioneering embedding models on both MMEB-V2 and UVRB benchmarks despite using limited computational resources.

Good For

Developing universal multimodal embeddings for diverse cross-modal tasks.
Applications requiring fine-grained matching capabilities across modalities.
Scenarios where targeted reasoning optimization can significantly improve multimodal embedding quality and generalization.

Overview

Embed-RL-4B: Reasoning-Driven Multimodal Embeddings

Key Capabilities & Innovations

Good For

Full Model Card (README)