The canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage2 is a 1 billion parameter Gemma-3-1B model by canbingol, specifically a second-epoch Stage 2 Turkish Continued Pretraining (CPT) variant. It is initialized from a previous Turkish CPT checkpoint and trained for one epoch on a 50,000 to 100,000 sample subset of a Turkish web corpus. This model is optimized for domain adaptation to Turkish language data through sequential continued pretraining.
Loading preview...
Overview
This model, canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage2, is a 1 billion parameter Gemma-3-1B variant focused on Turkish Continued Pretraining (CPT). It represents the second epoch, Stage 2 of a sequential training process, building upon the canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage1 checkpoint.
Key Characteristics
- Architecture: Gemma-3-1B base model.
- Training Objective: Continued Pretraining (CPT) for domain adaptation to Turkish.
- Initialization: Started from a prior Turkish CPT checkpoint, not the original
google/gemma-3-1b-pt. - Dataset: Trained on samples 50,000 to 100,000 of the
canbingol/vngrs-web-corpus-200kTurkish web corpus. - Epochs: Trained for 1 epoch on this specific data shard.
- Token Exposure: This stage added approximately 21.5 million tokens, bringing the cumulative exposure to around 129.2 million tokens across all CPT stages.
Use Cases
This model is particularly suited for applications requiring a language model with enhanced understanding and generation capabilities in Turkish, benefiting from its specialized continued pretraining on a Turkish web corpus. It is part of a multi-stage, multi-epoch CPT process designed to progressively adapt the base Gemma model to the Turkish language.