pkupie/gemma-3-4b-ug-cpt
The pkupie/gemma-3-4b-ug-cpt model is a 4.3 billion parameter Gemma 3 continual pretraining (CPT) checkpoint. Developed by pkupie, it is further pretrained on the Uyghur portion of the MC^2 Corpus, featuring a 32768 token context length. This model is specifically designed to enhance Uyghur language modeling and support research in low-resource language adaptation.
Loading preview...
Overview
This model, pkupie/gemma-3-4b-ug-cpt, is a 4.3 billion parameter Gemma 3 continual pretraining (CPT) checkpoint. It has been further pretrained on the Uyghur subset of the MC^2 Corpus, building upon the base Gemma 3 PT 4B model. The primary goal of this model is to advance Uyghur language modeling capabilities and facilitate research into the adaptation of low-resource languages.
Key Characteristics
- Base Model: Gemma 3 PT 4B
- Parameter Count: 4.3 billion
- Context Length: 32768 tokens
- Training Data: Uyghur portion of the MC^2 Corpus
- Training Paradigm: Continual Pretraining (CPT)
- Research Focus: Low-resource language adaptation, specifically for Uyghur.
Intended Use
This checkpoint is released primarily for research purposes. It serves as a foundational model for future work, particularly in areas such as model merging and logit fusion techniques. The methodology behind its training is detailed in the paper "Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion" (ACL 2026).
Good For
- Researchers working on Uyghur language processing.
- Experiments involving continual pretraining and adaptation for low-resource languages.
- Developing and testing model merging or logit fusion strategies.