IEITYuan/Yuan-embedding-2.0-zh
Yuan-embedding-2.0-zh is a 0.3 billion parameter embedding model developed by IEITYuan, specifically optimized for Chinese text retrieval and reranking tasks. Building upon Yuan-embedding-1.0, it incorporates advanced data augmentation techniques like hard negative sampling and LLM-synthesized data using Yuan2-M32. The model utilizes a multi-task loss function, Matryoshka Representation Learning, InfoNCE with in-batch negatives for retrieval, and a Margin-Adaptive Pairwise Ranking Loss for reranking, making it highly effective for semantic search and document ranking in Chinese.
Loading preview...
Yuan-embedding-2.0-zh Overview
Yuan-embedding-2.0-zh is a 0.3 billion parameter embedding model from IEITYuan, specifically engineered for Chinese text retrieval and reranking tasks. It represents an optimized iteration of its predecessor, Yuan-embedding-1.0, with significant enhancements in data processing and loss function design.
Key Capabilities
- Specialized for Chinese Text: Designed from the ground up for high performance in Chinese language contexts.
- Enhanced Retrieval: Utilizes InfoNCE with in-batch negatives for robust retrieval performance.
- Optimized Reranking: Incorporates a Margin-Adaptive Pairwise Ranking Loss to improve the accuracy of document reranking.
- Advanced Data Augmentation: Employs hard negative sampling, leveraging both Rerank models and LLMs for high-quality sample selection, and LLM-synthesized data using Yuan2-M32 for query rewriting.
- Multi-Task Learning: Benefits from a multi-task loss function and Matryoshka Representation Learning for comprehensive embedding capabilities.
Good For
- Semantic Search Systems: Ideal for building or enhancing search engines that require understanding the semantic meaning of Chinese queries and documents.
- Information Retrieval: Excellent for tasks involving finding relevant Chinese documents from large corpora.
- Document Ranking: Highly effective for reordering search results or recommendations to present the most relevant Chinese content first.
- Chinese NLP Applications: Any application requiring high-quality, context-aware embeddings for Chinese text.