iampanda/zpoint_large_embedding_zh

TEXT GENERATIONConcurrency Cost:1Model Size:0.3BQuant:BF16Ctx Length:32kPublished:Jun 4, 2024License:mitArchitecture:Transformer0.0K Open Weights Cold

The iampanda/zpoint_large_embedding_zh is a 0.3 billion parameter embedding model developed by iampanda, built upon the Stella base model. It is specifically designed for Chinese language tasks, excelling in retrieval, classification, and clustering by leveraging extensive hard negative sampling and LLM-synthesized data. This model is optimized for generating high-quality embeddings for diverse Chinese text, including specialized domains like healthcare and law.

Loading preview...

ZPoint Large Embedding for Chinese

The iampanda/zpoint_large_embedding_zh is a 0.3 billion parameter embedding model specifically engineered for Chinese language processing. It is built upon the Stella base model and distinguishes itself through its sophisticated training methodology.

Key Training Details

  • Base Model: Utilizes the Stella model as its foundation.
  • Hard Negative Sampling: Employs rigorous hard negative sampling strategies for both retrieval (10 hard negatives per query) and classification/clustering tasks (5 hard negatives from other classes).
  • LLM Data Synthesis: Leverages a ZPoint-72B LLM to synthesize approximately 30 million additional training samples, including query rewriting and new query generation for documents.
  • Extensive Data Collection: Incorporates a wide array of public datasets such as miracl/miracl, FreedomIntelligence/Huatuo26M-Lite, and others, totaling around 100 million training samples.
  • Domain Diversity: The training dataset spans various specialized fields, including healthcare, law, electricity, automotive, and consumer electronics.
  • Training Loss: Uses a multi-task loss approach similar to Piccolo and incorporates Matryoshka Representation Learning.

Use Cases

This model is particularly well-suited for applications requiring robust Chinese text embeddings, such as:

  • Information Retrieval: Enhancing search relevance and document matching.
  • Text Classification: Categorizing Chinese documents or queries.
  • Text Clustering: Grouping similar Chinese texts together.
  • Semantic Search: Finding semantically similar content in Chinese corpora.