Model Overview
This model, sfutenma/dpo-qwen3_4b-cot-merged_v260302-010243, is a 4 billion parameter language model built upon the unsloth/Qwen3-4B-Instruct-2507 base. It has undergone a two-stage fine-tuning process: Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO), both utilizing the Unsloth library. The model incorporates full-merged 16-bit weights, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Structured Output: The SFT phase specifically targeted improving the accuracy of structured outputs such as JSON, YAML, XML, TOML, and CSV formats. Loss was applied only to the final assistant output, masking intermediate reasoning during this stage.
- Improved Reasoning (Chain-of-Thought): Following SFT, DPO was applied to align the model's responses with preferred outputs, with a strong focus on enhancing its Chain-of-Thought (CoT) reasoning abilities.
- Optimized Alignment: DPO training, using a beta of 0.2 and a learning rate of 1e-06 over 5 epochs, further refined the model's responses for better quality and alignment with desired reasoning and structured output.
Training Details
The SFT stage involved 2 epochs with a learning rate of 5e-06 and QLoRA (4-bit) configuration (r=128, alpha=128). The DPO stage used a LoRA configuration of r=64, alpha=64, which was subsequently merged into the base model. Both stages utilized a maximum sequence length of 768 tokens.
Good For
This model is particularly well-suited for applications requiring:
- Generating accurate and complex structured data (e.g., API responses, configuration files).
- Tasks where logical reasoning and step-by-step thought processes are crucial.
- Scenarios demanding high-quality, aligned responses from a 4B parameter model.