Name: canbingol/gemma3_1B_base-tr-cpt-only_2nd_stage_data API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: canbingol

Model Overview

This model, canbingol/gemma3_1B_base-tr-cpt-only_2nd_stage_data, is a Turkish Continued Pretraining (CPT) variant of the google/gemma-3-1b-pt base model. Unlike typical multi-stage CPT, this version was trained only on the second shard (samples 50,000–100,000) of the canbingol/vngrs-web-corpus-200k dataset, with no prior stage adaptation. This unique training approach aims to isolate and measure the standalone impact of this specific data segment on the model's performance.

Key Training Details

Base Model: google/gemma-3-1b-pt
Dataset: canbingol/vngrs-web-corpus-200k (subset 50,000–100,000 samples)
Training Method: Standard continued pretraining (full model update, no LoRA)
Epochs: 1
Tokens Processed: Approximately 21.6 million

Intended Use and Research Focus

This model is primarily an experimental artifact, designed for researchers and developers interested in:

Analyzing data ordering effects in CPT.
Comparing incremental versus isolated data adaptation strategies.
Evaluating the sensitivity of models to specific corpus segments.

It serves as a controlled comparison point against Stage 1-only CPT models, sequential multi-stage CPT models, and LoRA-based CPT variants. Full experimental results are available in a Google Sheet.

Overview

Model Overview

Key Training Details

Intended Use and Research Focus

Full Model Card (README)