sdadas/stella-pl-retrieval
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kLicense:gemmaArchitecture:Transformer0.0K Cold

The sdadas/stella-pl-retrieval model is a 1.5 billion parameter text encoder developed by sdadas, based on stella_en_1.5B_v5 and fine-tuned for Polish information retrieval tasks. It was adapted for Polish using multilingual knowledge distillation and further fine-tuned with contrastive loss on 1.4 million queries. This model transforms texts into 1024-dimensional vectors and is specifically optimized for high performance in Polish information retrieval, achieving an NDCG@10 of 62.32 on the Polish Information Retrieval Benchmark.

Loading preview...

Overview

sdadas/stella-pl-retrieval is a 1.5 billion parameter text encoder, a specialized version of the stella_en_1.5B_v5 model, meticulously fine-tuned for Polish information retrieval (IR) tasks. It was developed by sdadas through a two-step process:

  1. Multilingual Adaptation: Initially adapted for Polish using a knowledge distillation method on a diverse corpus of 20 million Polish-English text pairs.
  2. Contrastive Fine-tuning: Further fine-tuned with contrastive loss on a dataset of 1.4 million queries, where positive and negative passages were selected using the BAAI/bge-reranker-v2.5-gemma2-lightweight.

This model encodes texts into 1024-dimensional vectors and is specifically designed to excel in retrieving relevant passages for Polish queries.

Key Capabilities

  • Specialized Polish IR: Optimized for information retrieval in the Polish language.
  • High Performance: Achieves an NDCG@10 of 62.32 on the Polish Information Retrieval Benchmark (PIRB).
  • Prompt-based Usage: Utilizes specific prompts for retrieval and symmetric tasks, consistent with the original stella_en_1.5B_v5.
  • Efficient Encoding: Transforms texts into 1024-dimensional embeddings.

Good For

  • Building Polish Search Engines: Ideal for applications requiring precise document or passage retrieval for Polish queries.
  • Semantic Search in Polish: Can be used for semantic similarity tasks within Polish text, though a more versatile encoder like sdadas/stella-pl might be preferred for broader semantic tasks.
  • Research in Polish NLP: Provides a strong baseline for further research and development in Polish information retrieval.