masachika/qwen3-4b-dpo-cot-merged
The masachika/qwen3-4b-dpo-cot-merged model is a 4 billion parameter Qwen3-based language model fine-tuned for improved reasoning and structured output generation. It underwent a two-stage fine-tuning process, first with Supervised Fine-Tuning (SFT) on structured output datasets (JSON, YAML, XML, TOML, CSV), followed by Direct Preference Optimization (DPO) for alignment and enhanced reasoning quality. This model is designed to provide aligned responses and generate structured data formats effectively, making it suitable for tasks requiring precise output formatting and logical coherence.
Loading preview...
Model Overview
The masachika/qwen3-4b-dpo-cot-merged is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base. It has been meticulously fine-tuned through a two-stage process to enhance its capabilities in generating structured outputs and improving reasoning.
Key Capabilities
- Structured Output Generation: Initially fine-tuned (SFT) to produce various structured data formats, including JSON, YAML, XML, TOML, and CSV.
- Improved Reasoning and Alignment: Further optimized using Direct Preference Optimization (DPO) with a specialized dataset (
u-10bei/dpo-dataset-qwen-cot) to align responses with preferred outputs and boost reasoning quality. - Full-Merged Weights: This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.
Training Details
The model's development involved:
- Stage 1 (SFT): Supervised Fine-Tuning on
Qwen/Qwen3-4B-Instruct-2507usingmasachika/qwen3-4b-Instruct-2507-structured-output-lorato teach structured output generation. - Stage 2 (DPO): Direct Preference Optimization on the SFT-merged model, focusing on aligning responses and improving reasoning over 2 epochs with a learning rate of 3e-07 and a max sequence length of 2048.
Good For
- Applications requiring precise, structured data output (e.g., API response generation, configuration file creation).
- Tasks benefiting from enhanced reasoning and aligned, high-quality responses.
- Developers seeking a readily deployable 4B parameter model with specialized fine-tuning for structured generation and improved coherence.