Stella-PL: Bilingual Polish-English Text Encoder
Stella-PL is a 1.5 billion parameter text encoder developed by sdadas, building upon the stella_en_1.5B_v5 architecture. It has been adapted for Polish through a multilingual knowledge distillation method, utilizing a corpus of 20 million Polish-English text pairs. The model generates 1024-dimensional embeddings for both languages, allowing for semantic comparisons within and across Polish and English texts.
Key Capabilities
- Bilingual Encoding: Processes both Polish and English texts into a unified embedding space.
- Cross-Lingual Semantic Search: Enables retrieval and similarity comparisons between Polish and English content.
- High Performance: Achieves a NDCG@10 of 60.52 on the Polish Information Retrieval Benchmark (PIRB).
- Optimized for Retrieval: Uses specific instruction prefixes for retrieval and semantic similarity tasks, similar to the original Stella model.
- Efficient Processing: Supports Flash Attention 2 for faster inference.
Good For
- Information Retrieval: Ideal for search applications requiring relevant passage retrieval in Polish or English, or cross-lingual search.
- Semantic Similarity: Suitable for tasks like identifying semantically similar sentences or documents in either language.
- Cross-Lingual Applications: Developing systems that need to understand and compare meaning across Polish and English.