IEITYuan/Yuan-embedding-2.0-en

Cold
Public
0.8B
BF16
40960
License: apache-2.0
Hugging Face
Overview

Overview

Yuan-embedding-2.0-en is an 0.8 billion parameter embedding model from IEITYuan, engineered for English text retrieval and reranking. It builds upon the foundation of Qwen/Qwen3-Embedding-0.6B and incorporates several key optimizations to enhance its performance in semantic search applications.

Key Capabilities

  • Optimized for English Text Retrieval: Specifically designed to generate high-quality embeddings for English text, facilitating accurate semantic search.
  • Enhanced for Reranking Tasks: Beyond initial retrieval, the model is also fine-tuned to improve the ranking of search results.
  • Advanced Data Augmentation:
    • Hard Negative Sampling: Employs a dual evaluation process using a Rerank model and an LLM to filter high-quality positive and negative samples, improving model robustness.
    • LLM-Synthesized Data: Leverages the Yuan2-M32 model to rewrite query data within the training dataset, expanding and diversifying the training examples.
  • Sophisticated Loss Function Design: Incorporates a multi-task loss function and Matryoshka Representation Learning. It uses InfoNCE with in-batch-negative for both retrieval and reranking tasks, which is crucial for learning effective representations.

Good For

  • Generating embeddings for English text.
  • Improving the accuracy of semantic search systems.
  • Enhancing the relevance and order of retrieved documents through reranking.
  • Applications requiring robust text similarity and contextual understanding in English.