PLLuM: A Family of Polish Large Language Models
PLLuM-12B-nc-instruct is part of the PLLuM family, a suite of large language models developed by a consortium of Polish scientific institutions, with Politechnika Wrocławska as the project leader. This 12 billion parameter instruction-tuned model is built upon the Mistral-Nemo-Base-2407 architecture and is specifically optimized for Polish and other Slavic/Baltic languages, while also incorporating English data for broader generalization. It features a substantial 32768 token context length.
Key Capabilities
- Deep Polish Language Specialization: Trained on up to 150 billion tokens of high-quality Polish text, along with additional Slavic, Baltic, and English data.
- Organic Instruction Tuning: Refined using a unique, manually curated dataset of approximately 40,000 Polish prompt-response pairs, including multi-turn dialogues, designed to capture subtle aspects of human-model interaction.
- Polish Preference Learning: Utilizes the first Polish-language preference corpus for alignment, enhancing correctness, balance, and safety, particularly for sensitive topics.
- Strong Performance: Achieves state-of-the-art results in general Polish-language tasks and top scores on custom benchmarks relevant to Polish public administration.
- Retrieval Augmented Generation (RAG) Support: Designed to perform well in RAG settings, making it suitable for complex information retrieval and question answering.
Good for
- General Polish Language Tasks: Text generation, summarization, and question answering in Polish.
- Domain-Specific Assistants: Particularly effective for applications in Polish public administration, legal, and bureaucratic contexts requiring domain-aware retrieval.
- Research & Development: Serving as a robust foundation for building downstream AI applications that demand strong command of the Polish language.
This specific model, PLLuM-12B-nc-instruct, is intended for non-commercial use under the CC-BY-NC-4.0 license.