OPI-PG/Qra-7b: A Foundation Model for Polish Language Processing
OPI-PG/Qra-7b is a 7 billion parameter large language model developed through a collaboration between the National Information Processing Institute (OPI) and Gdańsk University of Technology (PG). This model is adapted from Llama 2-7b-hf and has been extensively trained on a meticulously cleaned, filtered, and deduplicated corpus of approximately 90 billion Polish tokens, primarily sourced from web data including CommonCrawl and MADLAD-400.
Key Characteristics & Training
- Polish Language Focus: Specifically designed and trained for the Polish language, making it highly proficient in Polish text generation and understanding.
- Robust Preprocessing: The training data underwent rigorous preprocessing, including text normalization, URL removal, document filtering based on length and quality classifiers, language identification, and fuzzy deduplication within 18 topical domains.
- Technical Optimizations: Trained for one epoch on 4096-token sequences, utilizing advanced optimizations such as
torch.compile, adamw_apex_fused optimizer, Flash Attention 2, mixed precision, gradient accumulation, and FSDP.
Performance & Evaluation
Qra-7b demonstrates strong performance in perplexity benchmarks on Polish texts:
- PolEval-2018: Achieved a perplexity of 11.3, significantly outperforming other Polish models like
szymonrucinski/Curie-7B-v1 (13.5) and English models like meta-llama/Llama-2-7b-hf (24.3). - Long Documents (2024): Showed a perplexity of 4.5 on a new dataset of long Polish documents (news and scientific articles from 2024), surpassing
szymonrucinski/Curie-7B-v1 (4.8) and meta-llama/Llama-2-7b-hf (5.9).
Important Note
Qra models are foundation language models trained with a causal language modeling objective. They are not intended for conversational or instruction-following purposes out-of-the-box and require further fine-tuning for such applications.