canbingol/gemma3_1B_base-tr-cpt-only_2nd_stage_data

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1BQuant:BF16Ctx Length:32kPublished:Apr 4, 2026Architecture:Transformer Warm

This model is a Gemma-3-1B variant developed by canbingol, specifically a Turkish Continued Pretraining (CPT) model. It was trained exclusively on a 50,000-100,000 sample subset of a Turkish web corpus, isolating the impact of this specific data shard. The model's primary purpose is experimental, designed to analyze the effects of incremental versus isolated data adaptation in continued pretraining for Turkish language tasks.

Loading preview...

Model Overview

This model, canbingol/gemma3_1B_base-tr-cpt-only_2nd_stage_data, is a Turkish Continued Pretraining (CPT) variant of the google/gemma-3-1b-pt base model. Unlike typical multi-stage CPT, this version was trained only on the second shard (samples 50,000–100,000) of the canbingol/vngrs-web-corpus-200k dataset, with no prior stage adaptation. This unique training approach aims to isolate and measure the standalone impact of this specific data segment on the model's performance.

Key Training Details

  • Base Model: google/gemma-3-1b-pt
  • Dataset: canbingol/vngrs-web-corpus-200k (subset 50,000–100,000 samples)
  • Training Method: Standard continued pretraining (full model update, no LoRA)
  • Epochs: 1
  • Tokens Processed: Approximately 21.6 million

Intended Use and Research Focus

This model is primarily an experimental artifact, designed for researchers and developers interested in:

  • Analyzing data ordering effects in CPT.
  • Comparing incremental versus isolated data adaptation strategies.
  • Evaluating the sensitivity of models to specific corpus segments.

It serves as a controlled comparison point against Stage 1-only CPT models, sequential multi-stage CPT models, and LoRA-based CPT variants. Full experimental results are available in a Google Sheet.