ZoengHouNaam/Embed-RL-4B
ZoengHouNaam/Embed-RL-4B is a 4 billion parameter multimodal embedding model developed by Tsinghua University and Kuaishou Technology. It utilizes an Embedder-Guided Reinforcement Learning (EG-RL) framework to optimize a Reasoner for generating evidential Traceability Chain-of-Thought (T-CoT). This approach enhances universal multimodal embeddings by integrating reasoning relevant to target retrieval, outperforming pioneering models on MMEB-V2 and UVRB benchmarks with limited computational resources.
Loading preview...
Embed-RL-4B: Reasoning-Driven Multimodal Embeddings
Embed-RL-4B is a 4 billion parameter multimodal embedding model developed by Tsinghua University and Kuaishou Technology. It addresses limitations in existing generative embedding methods by proposing a reasoning-driven Universal Multimodal Embedding (UME) framework. This framework integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize a Reasoner, enabling it to produce evidential Traceability CoT (T-CoT).
Key Capabilities & Innovations
- Embedder-Guided Reinforcement Learning (EG-RL): A novel framework where the Embedder provides explicit supervision to the Reasoner, ensuring generated Chain-of-Thought (CoT) traces are aligned with embedding tasks.
- Traceability CoT (T-CoT): Extracts critical multimodal cues to focus on retrieval-relevant elements, providing multimodal inputs for the Embedder.
- Enhanced Cross-Modal Semantic Consistency: Integrates multimodal evidence in structured reasoning, paired with retrieval-oriented alignment.
- Strong Performance with Limited Resources: Outperforms pioneering embedding models on both MMEB-V2 and UVRB benchmarks despite using limited computational resources.
Good For
- Developing universal multimodal embeddings for diverse cross-modal tasks.
- Applications requiring fine-grained matching capabilities across modalities.
- Scenarios where targeted reasoning optimization can significantly improve multimodal embedding quality and generalization.