sfutenma/dpo-qwen3_4b-cot-merged_v260301-220140
The sfutenma/dpo-qwen3_4b-cot-merged_v260301-220140 model is a 4 billion parameter language model, fine-tuned from sfutenma/lora_structeval_t_qwen3_4b_v260228-172650 using Direct Preference Optimization (DPO) with the Unsloth library. This model is specifically optimized for improving reasoning capabilities through Chain-of-Thought (CoT) and generating high-quality structured responses. It is designed for applications requiring aligned and coherent outputs based on preferred data, supporting a 32768 token context length.
Loading preview...
Model Overview
This model, sfutenma/dpo-qwen3_4b-cot-merged_v260301-220140, is a 4 billion parameter language model derived from sfutenma/lora_structeval_t_qwen3_4b_v260228-172650. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting enhanced reasoning and structured response generation.
Key Capabilities
- Improved Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, aligning responses with preferred outputs.
- Structured Response Quality: Enhanced ability to produce high-quality, structured answers based on preference datasets.
- DPO Fine-tuning: Leverages DPO for better alignment and coherence in generated text.
- Merged Weights: Provides full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment with
transformers.
Training Details
The model was trained for 5 epochs with a learning rate of 2e-05 and a beta value of 0.03. It utilized a maximum sequence length of 768 during training and incorporated LoRA with r=8 and alpha=16, which has been merged into the base model. The training data used was u-10bei/dpo-dataset-qwen-cot.
Good For
- Applications requiring models with strong reasoning capabilities.
- Generating structured and aligned text outputs.
- Use cases where direct preference optimization leads to desired response quality.
License
The model is released under the MIT License, consistent with its training dataset. Users must also adhere to the original base model's license terms.