Curie-7B-v1: Efficient Polish Language Model

Curie-7B-v1 is a 7 billion parameter decoder-based LLM developed by Szymon Ruciński, showcasing the effectiveness of fine-tuning English LLMs for Polish. It achieves remarkable performance through Language Adaptive Pre-training (LAPT) on a high-quality 3.11 GB Polish dataset (276 million tokens), followed by fine-tuning on the KLEJ challenges.

Key Capabilities & Performance

Lowest Perplexity: Achieves a perplexity of 3.02 for Polish text generation among decoder-based models.
Efficient Training: Rivals the best Polish encoder-decoder models on 8 out of 9 KLEJ tasks, using only 2-3% of the typical dataset size.
Versatile Adaptation: Can be transformed into classifiers, regressors, and AI assistants for various Polish NLP tasks.
Benchmark Highlights (KLEJ tasks):
- NKJP-NER: 93.4
- CDSC-E: 92.2
- CDSC-R: 94.9
- PolEmo2.0-IN: 92.7
- PSC: 98.6

Training Details

The LAPT phase utilized a 2 GB high-quality extract from the SpeakLeash dataset. Training was conducted for one epoch over 106 hours, using an NVIDIA RTX A6000 ADA GPU. Further details are available in the research paper: Efficient Language Adaptive Pre-training: Extending State-of-the-Art Large Language Models for Polish.

Good For

Generating high-quality Polish text.
Developing efficient business solutions requiring Polish NLP capabilities.
Research and development in Polish language modeling with limited data resources.

Overview

Curie-7B-v1: Efficient Polish Language Model

Key Capabilities & Performance

Training Details

Good For

Full Model Card (README)