deepkick/qwen3-4b-struct-dpo-v11-merged
deepkick/qwen3-4b-struct-dpo-v11-merged is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized for structured response stability and schema adherence, making it suitable for tasks requiring precise output formats. It features full-merged 16-bit weights and a 40960 token context length, enhancing its capability for complex, structured data generation.
Loading preview...
Overview
This model, deepkick/qwen3-4b-struct-dpo-v11-merged, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library to enhance its ability to produce stable and schema-compliant structured responses.
Key Capabilities
- Structured Response Generation: Optimized specifically for generating outputs that adhere to predefined structures and schemas.
- DPO Fine-tuning: Leverages Direct Preference Optimization for improved response quality based on preference data.
- Merged Weights: Provided as full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.
- Base Model Context: Inherits the 40960 token context length from its base Qwen3-4B-Instruct model.
Training Details
The model was trained for 1 epoch with a learning rate of 2e-07 and a beta of 0.05, using a maximum sequence length of 1536. The training utilized the u-10bei/structured_data_with_cot_dataset_512_v2 dataset, which is licensed under the MIT License.
Good For
- Applications requiring reliable structured data output.
- Tasks where schema adherence is critical.
- Developers looking for a Qwen3-based model with enhanced response format consistency.