pkupie/gemma-3-4b-kk-cpt
pkupie/gemma-3-4b-kk-cpt is a 4.3 billion parameter Gemma 3 model continually pretrained on the Kazakh language (Arabic Script) portion of the MC^2 Corpus. Developed by pkupie, this model features a 32768-token context length and is specifically optimized for Kazakh language modeling and low-resource language adaptation research. It serves as a base model for further research in areas like model merging and logit fusion.
Loading preview...
Overview
This model, pkupie/gemma-3-4b-kk-cpt, is a 4.3 billion parameter Gemma 3 base model that has undergone continual pretraining (CPT). Its primary focus is on the Kazakh language (Arabic Script), utilizing a specific subset of the MC^2 Corpus for its training.
Key Capabilities
- Enhanced Kazakh Language Modeling: Significantly improves performance for Kazakh text in Arabic script.
- Low-Resource Language Adaptation: Designed to support research and development in adapting large language models to languages with limited data.
- Research Base Model: Intended as a foundational checkpoint for further academic exploration, particularly in advanced techniques like model merging and logit fusion.
Training Details
The model's training methodology is detailed in the paper "Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion" (ACL 2026). It leverages a CPT paradigm, building upon the original Gemma 3 PT 4B model.
Intended Use Cases
- Academic Research: Ideal for researchers studying low-resource language processing, model adaptation, and multilingual NLP.
- Base for Fine-tuning: Can serve as a strong starting point for fine-tuning on specific Kazakh language tasks.
- Experimentation: Suitable for exploring novel approaches in model merging and logit fusion within a low-resource context.