OPI-PG/Qra-1b

Warm
Public
1.1B
BF16
2048
1
Feb 26, 2024
License: apache-2.0
Hugging Face
Overview

OPI-PG/Qra-1b: A Polish-Optimized Foundation Model

OPI-PG/Qra-1b is a 1.1 billion parameter large language model developed collaboratively by the National Information Processing Institute (OPI) and Gdańsk University of Technology (PG). It is part of the Qra series, specifically adapted for the Polish language.

Key Characteristics & Training:

  • Base Model: Initialized from TinyLlama-1.1B checkpoints.
  • Polish Data Training: Further trained on a meticulously cleaned, filtered, and deduplicated corpus of approximately 90 billion Polish tokens, primarily sourced from web data like CommonCrawl and MADLAD-400.
  • Data Preprocessing: Utilized a robust pipeline including text normalization, removal of short documents, heuristic sentence cleaning, quality classification, perplexity-based filtering, topical domain assignment, and fuzzy deduplication.
  • Technical Optimizations: Trained with modern techniques such as torch.compile, adamw_apex_fused optimizer, Flash Attention 2, mixed precision, gradient accumulation, and FSDP.
  • Context Length: Supports a context length of 4096 tokens.

Performance:

  • PolEval-2018: Achieved a perplexity of 14.7 on the PolEval-2018 test set, outperforming many other Polish and English models in its size class.
  • Long Documents (2024): Demonstrated a perplexity of 6.1 on a new dataset of long Polish documents from 2024, indicating strong performance on contemporary and extended texts.

Important Note:

Qra-1b is a foundation language model trained with a causal language modeling objective. It is not intended for conversational or instruction-following tasks out-of-the-box and requires further fine-tuning for such applications.