huseyinatahaninan/appworld_distillation_sft-SFT-Qwen3-4B-Instruct-2507
This model is a 4 billion parameter Qwen3-Instruct variant, fine-tuned by huseyinatahaninan, with a 40960 token context length. It is specifically adapted from Qwen/Qwen3-4B-Instruct-2507 through supervised fine-tuning on the appworld_distillation_sft dataset. The model demonstrates a final validation loss of 0.2588, indicating its specialization for tasks related to the distillation dataset.
Loading preview...
Model Overview
This model, appworld_distillation_sft-SFT-Qwen3-4B-Instruct-2507, is a 4 billion parameter instruction-tuned variant of the Qwen3 architecture, developed by huseyinatahaninan. It is a supervised fine-tuned (SFT) version of the base model Qwen/Qwen3-4B-Instruct-2507, specifically trained on the appworld_distillation_sft dataset.
Training Details
The model underwent 10 epochs of training with a learning rate of 5e-06 and a total batch size of 32 across 8 GPUs. The training process utilized the adamw_torch optimizer and a cosine learning rate scheduler with a 0.1 warmup ratio. Over the training period, the validation loss steadily decreased, achieving a final reported loss of 0.2588.
Key Characteristics
- Base Model: Qwen3-4B-Instruct-2507
- Parameter Count: 4 billion
- Context Length: 40960 tokens
- Fine-tuning Dataset:
appworld_distillation_sft - Achieved Loss: 0.2588 on the evaluation set
Potential Use Cases
Given its fine-tuning on the appworld_distillation_sft dataset, this model is likely optimized for tasks and applications related to the specific domain or data characteristics of that dataset. Developers should consider its specialized training for use cases requiring performance within that particular data distribution.