jinkami07/dpo-qwen3-4b-r8-lr1e6-beta005-ep2-merged
The jinkami07/dpo-qwen3-4b-r8-lr1e6-beta005-ep2-merged model is a 4 billion parameter Qwen3-based instruction-tuned language model, fine-tuned using Direct Preference Optimization (DPO) by jinkami07. It is specifically optimized to improve reasoning capabilities, particularly Chain-of-Thought, and enhance structured response quality. This model excels at generating aligned and coherent outputs for complex prompts, making it suitable for tasks requiring logical progression and structured answers.
Loading preview...
Model Overview
This model, jinkami07/dpo-qwen3-4b-r8-lr1e6-beta005-ep2-merged, is a 4 billion parameter language model based on the Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its LoRA adapters (r=16, alpha=32) fully merged into the base model for direct use without additional adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, allowing for more logical and step-by-step problem-solving.
- Structured Response Quality: Fine-tuned to produce higher quality, more structured outputs based on preference datasets.
- DPO Alignment: Leverages DPO to align model responses with preferred human outputs, leading to more desirable and coherent generations.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-06 and a beta value of 0.1. It utilized a maximum sequence length of 1024 tokens during training. The training data used was u-10bei/dpo-dataset-qwen-cot, which focuses on preference-based optimization.
Good For
- Applications requiring improved logical reasoning and structured output generation.
- Tasks where response alignment with human preferences is critical.
- Developers seeking a Qwen3-based model with enhanced CoT capabilities.