Bielik-7B-v0.1: A Polish Language Model
Bielik-7B-v0.1 is a 7 billion parameter generative text model, a collaborative effort between the open-science project SpeakLeash and the High Performance Computing (HPC) center ACK Cyfronet AGH. It was developed by continuously pre-training the Mistral-7B-v0.1 model on over 70 billion tokens of meticulously selected and processed Polish text corpora.
Key Capabilities & Features
- Polish Language Expertise: Exhibits exceptional ability to understand and process the Polish language, providing accurate responses and performing various linguistic tasks with high precision.
- Base Model: Intended for further fine-tuning for specific use cases; a separate instruction-tuned version, Bielik-7B-Instruct-v0.1, is available for chat and instruction-following.
- Optimized Training: Trained on the Helios Supercomputer using 256 NVidia GH200 cards, achieving high throughput, and leveraging the ALLaMo framework for efficient language model training.
- High-Quality Data: Utilized a training dataset of over 36 billion Polish tokens, filtered for high quality using an XGBoost classification model based on 93 linguistic features.
Performance Highlights
Evaluated on the Open PL LLM Leaderboard (5-shot), Bielik-7B-v0.1 demonstrates strong performance, particularly in RAG-related tasks:
- RAG Reranking: Achieved 62.13, outperforming other 7B models like Mistral-7B-v0.1.
- RAG Reader: Scored 88.39, marking it as a top performer in open-book question answering among its peers.
When to Use This Model
- Polish Language Applications: Ideal for developers building applications requiring deep understanding and generation of Polish text.
- Further Fine-tuning: Best suited as a base model for custom fine-tuning to specific downstream tasks or domains.
- Research & Development: Valuable for researchers exploring large language models with a focus on Polish linguistic nuances and performance.