Overview
OPI-PG/Qra-1b: A Polish-Optimized Foundation Model
OPI-PG/Qra-1b is a 1.1 billion parameter large language model developed collaboratively by the National Information Processing Institute (OPI) and Gdańsk University of Technology (PG). It is part of the Qra series, specifically adapted for the Polish language.
Key Characteristics & Training:
- Base Model: Initialized from TinyLlama-1.1B checkpoints.
- Polish Data Training: Further trained on a meticulously cleaned, filtered, and deduplicated corpus of approximately 90 billion Polish tokens, primarily sourced from web data like CommonCrawl and MADLAD-400.
- Data Preprocessing: Utilized a robust pipeline including text normalization, removal of short documents, heuristic sentence cleaning, quality classification, perplexity-based filtering, topical domain assignment, and fuzzy deduplication.
- Technical Optimizations: Trained with modern techniques such as
torch.compile,adamw_apex_fusedoptimizer, Flash Attention 2, mixed precision, gradient accumulation, and FSDP. - Context Length: Supports a context length of 4096 tokens.
Performance:
- PolEval-2018: Achieved a perplexity of 14.7 on the PolEval-2018 test set, outperforming many other Polish and English models in its size class.
- Long Documents (2024): Demonstrated a perplexity of 6.1 on a new dataset of long Polish documents from 2024, indicating strong performance on contemporary and extended texts.
Important Note:
Qra-1b is a foundation language model trained with a causal language modeling objective. It is not intended for conversational or instruction-following tasks out-of-the-box and requires further fine-tuning for such applications.