ko3377/dpo-qwen-cot-merged16
The ko3377/dpo-qwen-cot-merged16 model is a 4 billion parameter instruction-tuned causal language model based on the Qwen3 architecture, fine-tuned by ko3377. It utilizes Direct Preference Optimization (DPO) via Unsloth to enhance reasoning capabilities, specifically Chain-of-Thought (CoT), and improve structured response quality. This model is designed for tasks requiring robust reasoning and well-structured outputs, offering a 32768 token context length.
Loading preview...
Model Overview
This model, ko3377/dpo-qwen-cot-merged16, is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned by ko3377 using Direct Preference Optimization (DPO) with the Unsloth library. The key objective of this optimization was to align the model's responses with preferred outputs, significantly improving its reasoning capabilities (Chain-of-Thought) and the overall quality of structured responses.
Key Features & Training Details
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Fine-tuning Method: Direct Preference Optimization (DPO)
- Optimization Focus: Enhanced reasoning (Chain-of-Thought) and structured output quality.
- Merged Weights: This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading during deployment.
- Training Data: Utilized datasets such as
u-10bei/structured_data_with_cot_dataset_512_v5andu-10bei/dpo-dataset-qwen-cot. - Context Length: Supports a maximum sequence length of 1152 during training, with the base model supporting up to 32768 tokens.
Recommended Use Cases
This model is particularly well-suited for applications requiring:
- Improved Reasoning: Tasks that benefit from explicit step-by-step thought processes.
- Structured Output Generation: Scenarios where well-formatted and coherent responses are critical.
- Instruction Following: General instruction-tuned tasks where high-quality, aligned outputs are desired.