moogician/DSR1-Qwen-32B-still
moogician/DSR1-Qwen-32B-still is a 32 billion parameter language model fine-tuned from deepseek-ai/DeepSeek-R1-Distill-Qwen-32B. This model was specifically fine-tuned on the "still" dataset, suggesting a specialization for tasks related to the characteristics of this particular dataset. It leverages a 32768 token context length, making it suitable for processing extensive inputs.
Loading preview...
Overview
moogician/DSR1-Qwen-32B-still is a 32 billion parameter language model derived from the deepseek-ai/DeepSeek-R1-Distill-Qwen-32B architecture. Its primary distinction lies in its fine-tuning on the "still" dataset, indicating a potential specialization for tasks aligned with the nature of this specific data. The model supports a substantial context length of 32768 tokens, allowing for the processing of lengthy texts and complex queries.
Training Details
The model was trained with a learning rate of 1e-05 over 17 epochs, utilizing a multi-GPU setup with 8 devices. Key hyperparameters included a train_batch_size of 2, gradient_accumulation_steps of 6, resulting in a total_train_batch_size of 96. The optimizer used was ADAMW_TORCH with standard betas and epsilon, and a cosine learning rate scheduler with a 0.1 warmup ratio. The training environment included Transformers 4.49.0, Pytorch 2.5.1+cu124, Datasets 3.2.0, and Tokenizers 0.21.0.
Intended Use
While specific intended uses and limitations are not detailed in the provided README, its fine-tuning on a particular dataset suggests it may excel in tasks related to that data's domain. Users should evaluate its performance for their specific applications, especially those requiring a large context window.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.