MasatoNishimura/dpo-qwen-cot-merged
MasatoNishimura/dpo-qwen-cot-merged is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized to enhance reasoning capabilities, particularly Chain-of-Thought (CoT), and improve the quality of structured responses. It features a 32768-token context length and is designed for applications requiring improved logical coherence and structured output.
Loading preview...
Model Overview
This model, dpo-qwen-cot-merged, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting improvements in reasoning and structured response generation.
Key Capabilities
- Enhanced Reasoning (Chain-of-Thought): Optimized to produce more coherent and logical reasoning steps in its outputs.
- Improved Structured Responses: Fine-tuned to generate higher quality, well-organized structured answers based on preference datasets.
- Full-Merged Weights: The repository provides full-merged 16-bit weights, eliminating the need for adapter loading.
Training Details
- Methodology: DPO (Direct Preference Optimization) was applied over 1 epoch.
- Configuration: Training utilized a learning rate of 1e-07 and a beta value of 0.1, with a maximum sequence length of 1024.
- Base Model: Qwen/Qwen3-4B-Instruct-2507.
- Training Data: The model was trained using the u-10bei/dpo-dataset-qwen-cot dataset.
Usage
This model can be directly integrated and used with the transformers library for inference, as it contains merged weights. Users should adhere to the MIT License of the training data and the original base model's license terms.