Overview of Llama-PLLuM-70B-base-2508

Llama-PLLuM-70B-base-2508 is a 70 billion parameter large language model, part of the PLLuM family, developed by the HIVE AI consortium. This model is built upon the Llama 3.1 architecture and is specifically designed for high performance in Polish and other Slavic/Baltic languages, while also incorporating English data for broader generalization. Its development involved extensive data collection, focusing on high-quality, legally compliant Polish text data (around 150 billion tokens after cleaning and deduplication).

Key Capabilities

Specialized Multilingualism: Optimized for Polish, Slavic, and Baltic languages, with strong performance in English due to incorporated data.
High-Quality Training Data: Continued-pretrained on approximately 18 billion tokens, including 17 billion in Polish, ensuring legal compliance and high data quality.
Organic Instruction Dataset: Benefits from a unique, manually curated Polish instruction set (~55k prompt-response pairs) designed to mitigate negative linguistic transfer and cover subtle SFT aspects.
Polish Preference Corpus: Utilizes the first Polish-language preference corpus for alignment, enhancing correctness, balance, and safety, especially for sensitive topics.
Strong Evaluation: Achieves top scores on custom benchmarks relevant to Polish public administration and state-of-the-art results in broader Polish-language tasks.

Good For

General Language Tasks: Text generation, summarization, and question answering in Polish.
Domain-Specific Assistants: Particularly effective for applications within Polish public administration, legal, and bureaucratic contexts requiring domain-aware retrieval.
Research & Development: Serving as a foundational model for building downstream AI applications that require a strong command of the Polish language.

Overview

Overview of Llama-PLLuM-70B-base-2508

Key Capabilities

Good For

Full Model Card (README)