canbingol/gemma3_1B_base-tr-cpt-only_3rd_stage_data
The canbingol/gemma3_1B_base-tr-cpt-only_3rd_stage_data model is a 1 billion parameter Gemma-3-1B variant, specifically continued pre-trained on a subset of the Turkish web corpus (samples 100,000–150,000) from the vngrs-web-corpus-200k dataset. This model was developed by canbingol to isolate and evaluate the standalone impact of a specific data shard, making it ideal for research into data ordering effects and incremental adaptation in continued pretraining. It focuses on understanding how a single segment of Turkish web data influences model performance without prior stage adaptation.
Loading preview...
Model Overview
This model, canbingol/gemma3_1B_base-tr-cpt-only_3rd_stage_data, is a 1 billion parameter Gemma-3-1B variant that has undergone Continued Pretraining (CPT) exclusively on a specific subset of a Turkish web corpus. Unlike typical multi-stage CPT, this model was trained solely on samples 100,000–150,000 of the canbingol/vngrs-web-corpus-200k dataset, without any prior stage adaptation.
Key Characteristics
- Base Model:
google/gemma-3-1b-pt - Training Method: Standard continued pretraining with full model updates (no LoRA or other parameter-efficient methods).
- Dataset Focus: Trained only on the third shard (samples 100K–150K) of a Turkish web corpus, comprising approximately 21.6 million tokens over one epoch.
- Research Objective: Designed to isolate and measure the effect of this specific data shard, enabling analysis of data ordering effects and incremental vs. isolated adaptation.
Intended Use Cases
This model is primarily intended for research and experimental purposes, specifically for:
- Comparative Analysis: Evaluating the isolated impact of a single data shard against models trained on other individual shards or sequential multi-stage CPT models.
- Understanding Data Effects: Investigating how specific segments of a corpus influence model sensitivity and adaptation.
- Experimental Benchmarking: Serving as a controlled baseline for experiments related to continued pretraining strategies and data regime effects in Turkish language models.