canbingol/gemma3_1B_base-tr-cpt-only_4th_stage_data
The canbingol/gemma3_1B_base-tr-cpt-only_4th_stage_data model is a 1 billion parameter Gemma-3-1B variant developed by canbingol, specifically continued pre-trained on a 50,000-sample subset (samples 150,000–200,000) of a Turkish web corpus. This model isolates the impact of a specific data shard, making it ideal for research into data ordering effects and incremental adaptation in continued pretraining. It is designed for comparative analysis against other CPT models to understand the sensitivity of the model to particular corpus segments.
Loading preview...
Overview
This model, canbingol/gemma3_1B_base-tr-cpt-only_4th_stage_data, is a 1 billion parameter Gemma-3-1B variant that has undergone Turkish Continued Pretraining (CPT). Unlike typical multi-stage CPT, this model was trained exclusively on a specific subset of the Turkish web corpus (samples 150,000–200,000 from the fourth shard) to isolate and measure the effect of this particular data segment.
Key Characteristics
- Base Model:
google/gemma-3-1b-pt - Training Method: Standard continued pretraining with full model updates (no LoRA).
- Dataset: A 50,000-sample subset of the
canbingol/vngrs-web-corpus-200kdataset. - Objective: To evaluate the standalone impact of a specific stage of Turkish web data, independent of prior adaptations.
Intended Use Cases
This model is primarily a research tool designed for controlled comparisons and analysis, specifically to investigate:
- The effects of data ordering in continued pretraining.
- Differences between incremental and isolated model adaptation.
- The sensitivity of the Gemma-3-1B architecture to specific segments of a Turkish corpus.
It is intended for comparison against other CPT models, including stage-specific and sequential multi-stage variants, to understand how different data regimes influence model performance and characteristics. Detailed experimental results are available in a Google Sheet.