Overview
This model, canbingol/gemma3_1B_base-tr-cpt-1epoch_stage4, is a 1 billion parameter Gemma-3-1B variant that has undergone Stage 4 of Turkish Continued Pretraining (CPT). It was initialized from canbingol/gemma3_1B_base-tr-cpt-1epoch_stage3, making it a direct continuation of previous training efforts.
Key Characteristics
- Turkish Language Focus: Specifically adapted for the Turkish language through continued pretraining on a Turkish web corpus.
- Sequential CPT: This model is the culmination of a four-stage sequential CPT process, where each stage trained on a disjoint shard of the dataset.
- Cumulative Data Exposure: By the end of Stage 4, the model has been cumulatively exposed to 200,000 samples from the
canbingol/vngrs-web-corpus-200k dataset. - Training Objective: Continued Pretraining for 1 epoch on samples 150,000–200,000, inheriting adaptations from prior stages.
- Token Count: This stage alone processed approximately 21.6 million tokens, contributing to a cumulative total of around 86.1 million tokens across all four stages.
Training Lineage
This model's training lineage is a sequential progression:
- Stage 0:
google/gemma-3-1b-pt - Stage 1: Samples 0–50,000
- Stage 2: Samples 50,000–100,000
- Stage 3: Samples 100,000–150,000
- Stage 4 (this model): Samples 150,000–200,000, completing the first full epoch over the 200K-sample dataset.
Use Cases
This model is suitable for applications requiring a compact, Turkish-centric language model, particularly for tasks benefiting from its domain adaptation to Turkish web content.