PLLuM-12B-base: A Polish-Centric Language Model
The CYFRAGOVPL/PLLuM-12B-base is a 12 billion parameter base model, part of the PLLuM family of large language models. Developed by a consortium of leading Polish scientific institutions, this model is built on the Mistral-Nemo-Base-2407 architecture and is specifically optimized for Polish and other Slavic/Baltic languages, while also incorporating English data for enhanced generalization.
Key Capabilities
- Specialized Language Focus: Extensive pre-training on up to 150 billion tokens of Polish text, alongside Slavic, Baltic, and English data, making it highly proficient in these languages.
- High-Quality Training Data: Utilizes a large-scale, high-quality text corpus, including a unique collection of ~40k manually created "organic instructions" in Polish for fine-tuning.
- Robust Alignment: Benefits from the first Polish-language preference corpus, which teaches the model factual and linguistic correctness, balance, and safety, particularly for sensitive topics.
- Strong Performance: Achieves top scores on custom benchmarks for Polish public administration tasks and state-of-the-art results in broader Polish-language evaluations.
Good For
- General Language Tasks: Ideal for text generation, summarization, and question answering in Polish and related languages.
- Domain-Specific Applications: Particularly effective for developing intelligent assistants and applications in Polish public administration, legal, and bureaucratic sectors.
- Research and Development: Serves as a robust foundation for building downstream AI applications requiring strong command of the Polish language.