The canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage1 is a 1 billion parameter Gemma-3-1B model, specifically a second-epoch continued pretraining (CPT) variant optimized for the Turkish language. This model builds upon a fully trained first epoch checkpoint, undergoing further domain adaptation by re-exposing it to an initial subset of the Turkish web corpus. It is designed for enhanced performance in Turkish language generation and understanding tasks, leveraging its continued pretraining on approximately 107.7 million cumulative tokens.
Loading preview...
Overview
This model, canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage1, is a 1 billion parameter Gemma-3-1B variant that has undergone second-epoch continued pretraining (CPT) specifically for the Turkish language. It is initialized from the checkpoint of the completed first epoch (canbingol/gemma3_1B_base-tr-cpt-1epoch_stage4), indicating a refinement and further adaptation process rather than initial training.
Key Characteristics
- Architecture: Based on the Gemma-3-1B model.
- Language Focus: Optimized for Turkish through continued pretraining.
- Training Data: Trained on samples 0-50,000 of the
canbingol/vngrs-web-corpus-200kTurkish web corpus during this stage. - Training Method: Sequential CPT across disjoint data shards, with this stage representing the beginning of the second epoch's pass over the initial data subset.
- Cumulative Exposure: Approximately 107.7 million tokens after this stage, building on 86.1 million tokens from the first epoch.
Intended Use Cases
- Turkish Language Generation: Ideal for tasks requiring text generation in Turkish.
- Turkish NLP Applications: Suitable for various natural language processing tasks where strong Turkish language understanding is beneficial.
- Further Adaptation: Serves as a strong base for additional fine-tuning on specific Turkish datasets or tasks.