Model Overview
This model, canbingol/gemma3_1B_base-tr-cpt-1epoch_stage3, is a 1 billion parameter Gemma-based language model that has undergone Stage 3 Turkish Continued Pretraining (CPT). It was initialized from canbingol/gemma3_1B_base-tr-cpt-1epoch_stage2, making it a direct continuation of prior Turkish language adaptation efforts.
Key Characteristics
- Turkish Language Focus: Specifically adapted for the Turkish language through continued pretraining on a Turkish web corpus.
- Sequential CPT: This is the third stage in a sequential pretraining process, building on data from previous stages. It was trained on samples 100,000 to 150,000 of the
canbingol/vngrs-web-corpus-200k dataset. - Cumulative Data Exposure: Cumulatively, the model has been exposed to approximately 150,000 samples from the Turkish corpus across its three pretraining stages.
- Gemma Architecture: Based on the Gemma-3-1B architecture, providing a compact yet capable foundation.
Good For
- Turkish Text Generation: Ideal for tasks requiring the generation of coherent and contextually relevant Turkish text.
- Turkish NLP Applications: Suitable for various natural language processing tasks in Turkish, benefiting from its domain-adapted training.
- Further Fine-tuning: Can serve as a strong base model for further task-specific fine-tuning in Turkish.