HiTZ/gl_Qwen3-8B-Base is an 8 billion parameter Galician (gl) language-specific base language model developed by the HiTZ Research Center. It is built upon the Qwen3-8B-Base architecture and further pretrained on approximately 3.5 billion tokens of curated Galician data, alongside a small English subset. This model is designed as a foundational base for further fine-tuning, instruction tuning, or domain adaptation in Galician.
Loading preview...
Model Overview
HiTZ/gl_Qwen3-8B-Base is a specialized 8 billion parameter language model developed by the HiTZ Research Center. It is a Galician (gl) language-specific base model, derived from the Qwen3-8B-Base architecture, and has undergone further pretraining on extensive Galician datasets.
Key Characteristics
- Language Focus: Primarily trained on Galician data, making it highly proficient in the language.
- Base Model: Released as a base model, it is intended for subsequent fine-tuning, instruction tuning, or domain adaptation to suit specific applications.
- Training Data: Pretrained on approximately 3.5 billion Galician tokens sourced from CorpusNÓS (web crawls, public administration texts) and a 0.3 billion English token subset from FineWeb to mitigate catastrophic forgetting.
- Training Methodology: Follows the methodology for low-resource languages proposed by Etxaniz et al. (2024), ensuring comparable corpus sizes across languages.
- Context Length: Supports a sequence length of 8,196 tokens during training.
Intended Use Cases
This model is particularly well-suited for:
- Galician Language Applications: Developing applications that require deep understanding and generation in Galician.
- Further Fine-tuning: Serving as a strong foundation for instruction-tuned models, domain-specific models, or task-specific models in Galician.
- Research: Exploring language model adaptation and performance in low-resource languages.