buddhist-nlp/gemma-2-mitra-e

TEXT GENERATIONConcurrency Cost:1Model Size:9BQuant:FP8Ctx Length:16kPublished:May 23, 2025Architecture:Transformer0.0K Cold

The buddhist-nlp/gemma-2-mitra-e is a multilingual sentence embedding model based on the Gemma 2 architecture, developed by buddhist-nlp. It is designed for semantic similarity and retrieval tasks, specifically optimized for cross-lingual sentence alignment in languages such as Sanskrit, Tibetan, Pali, Chinese, and English. This model functions as an encoder, converting input text into L2-normalized embeddings for efficient comparison and search.

Loading preview...

Overview

The gemma-2-mitra-embedding model, developed by buddhist-nlp, is a specialized multilingual sentence embedding model built upon the Gemma 2 architecture. It functions as an encoder, transforming input text into L2-normalized embeddings, primarily for semantic similarity and retrieval tasks. A key feature is its design for asymmetric inputs, requiring a specific instruction-based format for queries and raw text for corpus passages.

Key Capabilities

  • Multilingual Semantic Similarity: Excels at comparing sentences across languages including Sanskrit, Tibetan, Pali, Chinese, and English.
  • Retrieval Systems: Optimized for nearest-neighbor search by encoding queries with an <instruct> template and corpus passages as raw text.
  • Cross-Lingual Alignment: Specifically used within the Mitra alignment stack for sentence-level alignment of Buddhist texts.
  • Instruction-Aware Embeddings: Utilizes special tokens (<instruct>, <query>) to generate context-specific embeddings for queries.

Good For

  • Buddhist NLP Research: Ideal for tasks involving ancient and modern Buddhist texts in various languages.
  • Multilingual RAG/Search: Applications requiring robust multilingual, instruction-aware query/corpus embeddings.
  • Custom Alignment Pipelines: Integration into systems like Bertalign for enhanced sentence alignment.
  • Any application needing L2-normalized sentence vectors for semantic comparison in the supported languages.