Hi-Satoh/sft-base4-dpo-e2-qwen-cot-merged
Hi-Satoh/sft-base4-dpo-e2-qwen-cot-merged is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507. Developed by Hi-Satoh, this model utilizes Direct Preference Optimization (DPO) via Unsloth to enhance its reasoning and alignment capabilities. It is specifically optimized for tasks requiring improved reasoning, building upon domain-specific knowledge acquired during its Supervised Fine-Tuning (SFT) phase.
Loading preview...
Overview
Hi-Satoh/sft-base4-dpo-e2-qwen-cot-merged is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. This model has undergone a multi-stage training process, beginning with Supervised Fine-Tuning (SFT) and subsequently enhanced with Direct Preference Optimization (DPO) using the Unsloth library. The primary objective of this fine-tuning was to significantly improve the model's reasoning and alignment capabilities.
Key Capabilities
- Enhanced Reasoning: The model's DPO phase specifically targeted improvements in reasoning, building on domain-specific knowledge from its SFT phase.
- Optimized Alignment: DPO training contributes to better alignment with desired outputs and user preferences.
- Full-Merged Weights: This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.
- Qwen3-4B-Instruct-2507 Base: Leverages the robust architecture and initial capabilities of the Qwen3-4B-Instruct-2507 model.
Training Details
The model's training involved:
- SFT Phase: Initial training on the
u-10bei/structured_data_with_cot_dataset_512_v5dataset to acquire domain-specific knowledge. - DPO Phase: Applied Direct Preference Optimization for 2 epochs with a learning rate of 1e-07 and a beta of 0.1, using
u-10bei/dpo-dataset-qwen-cotas the training data. The maximum sequence length for DPO was 2048 tokens.
Good For
- Applications requiring a 4B parameter model with improved reasoning and alignment.
- Developers looking for a readily deployable Qwen3-4B variant with enhanced performance characteristics.
- Use cases benefiting from a model fine-tuned with DPO for better preference alignment.