pkupie/Qwen2.5-1.5B-ug-cpt
pkupie/Qwen2.5-1.5B-ug-cpt is a 1.5 billion parameter Qwen2.5 model continually pretrained on the Uyghur portion of the MC^2 Corpus, featuring a 32768 token context length. Developed by pkupie, this model is specifically adapted for improved Uyghur language modeling. It is primarily intended for research into low-resource language adaptation, model merging, and logit fusion techniques.
Loading preview...
Overview
This model, pkupie/Qwen2.5-1.5B-ug-cpt, is a specialized checkpoint derived from the Qwen2.5 1.5B base model. It has undergone continual pretraining (CPT) specifically on the Uyghur subset of the MC^2 Corpus.
Key Capabilities & Purpose
- Uyghur Language Adaptation: The primary goal of this model is to enhance language modeling capabilities for Uyghur, a low-resource language.
- Research Focus: It is released as a research artifact to support studies in low-resource language adaptation, particularly for methodologies like model merging and logit fusion.
- Base for Further Work: Researchers can utilize this CPT checkpoint as a foundational model for developing new techniques or applications in the domain of multilingual and low-resource NLP.
Training Details
The model's training methodology and insights are detailed in the paper "Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion" (ACL 2026), available on arXiv.
Intended Use Cases
- Academic Research: Ideal for researchers exploring techniques for adapting large language models to low-resource languages.
- Model Merging & Logit Fusion: Serves as a suitable base model for experiments involving the combination of models or fusion of their outputs.