Name: canbingol/gemma3_1B_base-tr-cpt-only_3rd_stage_data API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: canbingol

Model Overview

This model, canbingol/gemma3_1B_base-tr-cpt-only_3rd_stage_data, is a 1 billion parameter Gemma-3-1B variant that has undergone Continued Pretraining (CPT) exclusively on a specific subset of a Turkish web corpus. Unlike typical multi-stage CPT, this model was trained solely on samples 100,000–150,000 of the canbingol/vngrs-web-corpus-200k dataset, without any prior stage adaptation.

Key Characteristics

Base Model: google/gemma-3-1b-pt
Training Method: Standard continued pretraining with full model updates (no LoRA or other parameter-efficient methods).
Dataset Focus: Trained only on the third shard (samples 100K–150K) of a Turkish web corpus, comprising approximately 21.6 million tokens over one epoch.
Research Objective: Designed to isolate and measure the effect of this specific data shard, enabling analysis of data ordering effects and incremental vs. isolated adaptation.

Intended Use Cases

This model is primarily intended for research and experimental purposes, specifically for:

Comparative Analysis: Evaluating the isolated impact of a single data shard against models trained on other individual shards or sequential multi-stage CPT models.
Understanding Data Effects: Investigating how specific segments of a corpus influence model sensitivity and adaptation.
Experimental Benchmarking: Serving as a controlled baseline for experiments related to continued pretraining strategies and data regime effects in Turkish language models.

Overview

Model Overview

Key Characteristics

Intended Use Cases

Full Model Card (README)