Model Overview
The rk611/dpo-qwen-cot-merged model is a 4 billion parameter language model based on the Qwen/Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, enabling more logical and structured problem-solving.
- Aligned Responses: DPO training aligns the model's outputs with preferred responses, leading to higher quality and more relevant generations.
- Structured Output: Focuses on generating well-structured and coherent responses based on the preference dataset used during training.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 1024 tokens. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset, and the model is released under the MIT License, adhering to the original base model's license terms.
Good For
- Applications requiring improved logical reasoning and step-by-step thought processes.
- Use cases where response quality and alignment with specific preferences are critical.
- Generating structured and coherent text outputs.