DeepSeek-R1-Distill-Qwen-1.5B Overview

This model is a fine-tuned iteration of the deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B base model. While the specific dataset used for fine-tuning and its resulting specialized capabilities are not detailed in the provided information, the model's architecture suggests a focus on efficient language processing.

Training Details

The model underwent training with the following key hyperparameters:

Learning Rate: 1e-05
Optimizer: ADAMW_TORCH with default betas and epsilon
LR Scheduler: Cosine with a warmup ratio of 0.03
Epochs: 3
Batch Size: A total training batch size of 16 (4 per device across 2 GPUs with 2 gradient accumulation steps).

Current Limitations

As per the available documentation, detailed information regarding the model's intended uses, specific limitations, and the dataset used for its fine-tuning is currently not provided. Users should exercise caution and conduct their own evaluations for specific applications.