STRV/dpo-qwen-cot-merged
STRV/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-based causal language model fine-tuned by STRV using Direct Preference Optimization (DPO). This model is specifically optimized for improving reasoning capabilities, particularly Chain-of-Thought (CoT), and generating high-quality structured responses. It leverages a 32K context length and is designed for applications requiring enhanced logical processing and coherent output generation.
Loading preview...
Model Overview
STRV/dpo-qwen-cot-merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged for direct use without adapter loading.
Key Optimizations
This model's primary objective is to enhance its reasoning capabilities, specifically focusing on Chain-of-Thought (CoT) processes, and to improve the quality of structured responses. This optimization was achieved through DPO training on a preference dataset (u-10bei/dpo-dataset-qwen-cot) over one epoch.
Technical Details
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Fine-tuning Method: DPO
- Max Sequence Length: 1024 (during training)
- License: MIT License (derived from the dataset terms), with compliance to the original base model's license terms.
Usage
As a merged model, it can be directly loaded and used with the transformers library for inference, supporting a 32K context length.