PLLuM: A Family of Polish Large Language Models
CYFRAGOVPL/Llama-PLLuM-8B-chat is an 8 billion parameter model from the PLLuM family, built upon the Llama 3.1 architecture. Developed by a consortium of leading Polish scientific institutions, PLLuM models are specialized in Polish and other Slavic/Baltic languages, with additional English data for broader generalization. This specific model is a chat-tuned variant, aligned on human preferences for safer and more efficient dialogue.
Key Capabilities
- Extensive Polish Data: Pretrained on up to 150 billion tokens of Polish text, along with Slavic, Baltic, and English data.
- Organic Instruction Tuning: Refined using a unique, manually curated dataset of approximately 40,000 Polish "organic instructions," including multi-turn dialogues, designed to mitigate negative linguistic transfer.
- Polish Preference Corpus: Features the first Polish-language preference corpus, manually assessed by diverse annotators to teach correctness, balance, and safety.
- State-of-the-Art Polish Performance: Achieves top scores on custom benchmarks relevant to Polish public administration and state-of-the-art results in broader Polish-language tasks.
- Retrieval Augmented Generation (RAG) Support: Trained to perform well in RAG settings, with a specific
.jinja prompt format for document-based question answering and citation.
Good For
- General Language Tasks: Text generation, summarization, and question answering in Polish.
- Domain-Specific Assistants: Particularly effective for applications in Polish public administration, legal, or bureaucratic contexts requiring domain-aware retrieval.
- Research & Development: Serving as a foundational model for AI applications where strong command of the Polish language is essential.