Model Overview
This model, demimomi/dpo-qwen-cot-merged, is a 4 billion parameter language model based on the Qwen3-4B-Instruct-2507 architecture. It was fine-tuned by demimomi using Direct Preference Optimization (DPO) via the Unsloth library. The primary goal of this fine-tuning was to align the model's responses with preferred outputs, specifically focusing on improving reasoning (Chain-of-Thought) and the quality of structured responses.
Key Features & Training
- Base Model: Qwen/Qwen3-4B-Instruct-2507.
- Optimization Method: Direct Preference Optimization (DPO) for aligning model behavior with desired outputs.
- Training Objective: Enhanced reasoning (Chain-of-Thought) and improved structured output generation.
- Training Configuration: Trained for 2 epochs with a learning rate of 1e-06 and a beta of 0.05. The maximum sequence length used during training was 1536 tokens.
- Deployment: This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading.
Usage
This model can be used directly with the Hugging Face transformers library for text generation tasks. It is particularly suitable for applications where coherent reasoning and well-structured outputs are critical.
License
The model is released under the MIT License, with users also required to comply with the original base model's license terms.