Overview
The gemma-2-mitra-embedding model, developed by buddhist-nlp, is a specialized multilingual sentence embedding model built upon the Gemma 2 architecture. It functions as an encoder, transforming input text into L2-normalized embeddings, primarily for semantic similarity and retrieval tasks. A key feature is its design for asymmetric inputs, requiring a specific instruction-based format for queries and raw text for corpus passages.
Key Capabilities
- Multilingual Semantic Similarity: Excels at comparing sentences across languages including Sanskrit, Tibetan, Pali, Chinese, and English.
- Retrieval Systems: Optimized for nearest-neighbor search by encoding queries with an
<instruct> template and corpus passages as raw text. - Cross-Lingual Alignment: Specifically used within the Mitra alignment stack for sentence-level alignment of Buddhist texts.
- Instruction-Aware Embeddings: Utilizes special tokens (
<instruct>, <query>) to generate context-specific embeddings for queries.
Good For
- Buddhist NLP Research: Ideal for tasks involving ancient and modern Buddhist texts in various languages.
- Multilingual RAG/Search: Applications requiring robust multilingual, instruction-aware query/corpus embeddings.
- Custom Alignment Pipelines: Integration into systems like Bertalign for enhanced sentence alignment.
- Any application needing L2-normalized sentence vectors for semantic comparison in the supported languages.