Model Overview
This model, ko3377/dpo-qwen-cot-merged16, is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned by ko3377 using Direct Preference Optimization (DPO) with the Unsloth library. The key objective of this optimization was to align the model's responses with preferred outputs, significantly improving its reasoning capabilities (Chain-of-Thought) and the overall quality of structured responses.
Key Features & Training Details
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Fine-tuning Method: Direct Preference Optimization (DPO)
- Optimization Focus: Enhanced reasoning (Chain-of-Thought) and structured output quality.
- Merged Weights: This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading during deployment.
- Training Data: Utilized datasets such as
u-10bei/structured_data_with_cot_dataset_512_v5 and u-10bei/dpo-dataset-qwen-cot. - Context Length: Supports a maximum sequence length of 1152 during training, with the base model supporting up to 32768 tokens.
Recommended Use Cases
This model is particularly well-suited for applications requiring:
- Improved Reasoning: Tasks that benefit from explicit step-by-step thought processes.
- Structured Output Generation: Scenarios where well-formatted and coherent responses are critical.
- Instruction Following: General instruction-tuned tasks where high-quality, aligned outputs are desired.