NobutaMN/dpo-qwen-cot-merged is a 4 billion parameter language model fine-tuned by NobutaMN using Direct Preference Optimization (DPO) on a Qwen3-4B-Instruct base. This model specializes in enhancing reasoning capabilities, particularly Chain-of-Thought (CoT), and improving the quality of structured responses. It is designed to align more closely with preferred outputs for complex reasoning tasks, making it suitable for applications requiring coherent and well-structured logical deductions.
Loading preview...
Overview
NobutaMN/dpo-qwen-cot-merged is a 4 billion parameter language model that has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library. This model builds upon the NobutaMN/qwen3-4b-structevalt-lora-nobuta-v2-3change base, with a primary focus on improving its reasoning (Chain-of-Thought) and structured response generation capabilities.
Key Capabilities
- Enhanced Reasoning: Optimized to produce more coherent and logical Chain-of-Thought reasoning.
- Improved Structured Responses: Fine-tuned to generate higher quality, well-structured outputs based on preference data.
- DPO Alignment: Leverages Direct Preference Optimization to align model behavior with desired response patterns.
Good For
- Applications requiring robust logical deduction and step-by-step reasoning.
- Tasks where structured and consistent output formats are critical.
- Use cases benefiting from models aligned with human preferences for response quality.
This repository provides LoRA adapter weights only, which must be loaded on top of the specified base model (Qwen/Qwen3-4B-Instruct-2507). The training utilized a learning rate of 1e-07 over 1 epoch, with a maximum sequence length of 1024.