pkupie/Qwen2.5-3B-ug-cpt
pkupie/Qwen2.5-3B-ug-cpt is a 3.1 billion parameter Qwen2.5-based language model continually pretrained on the Uyghur subset of the MC^2 Corpus. Developed by pkupie, this model is specifically adapted for the Uyghur language, enhancing its language modeling capabilities for this low-resource language. It is primarily intended for research in low-resource language adaptation, particularly as a base for model merging and logit fusion.
Loading preview...
Overview
pkupie/Qwen2.5-3B-ug-cpt is a 3.1 billion parameter language model that has undergone continual pretraining (CPT). It builds upon the Qwen2.5-3B architecture, with further pretraining specifically on the Uyghur portion of the MC^2 Corpus.
This model was developed to improve language modeling for Uyghur, a low-resource language, and to support research in language adaptation. The methodology and training details are described in the paper "Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion" (ACL 2026).
Key Characteristics
- Base Model: Qwen2.5-3B
- Parameter Count: 3.1 billion
- Context Length: 32768 tokens
- Language Focus: Uyghur (
ug) - Training Paradigm: Continual pretraining (CPT)
- Training Data: Uyghur subset of the MC^2 Corpus
Intended Use Cases
This checkpoint is primarily released for research purposes. It is suitable for:
- Further research in low-resource language adaptation.
- Serving as a base model for experiments in model merging.
- Applications involving logit fusion techniques.