STRV/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 2, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

STRV/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-based causal language model fine-tuned by STRV using Direct Preference Optimization (DPO). This model is specifically optimized for improving reasoning capabilities, particularly Chain-of-Thought (CoT), and generating high-quality structured responses. It leverages a 32K context length and is designed for applications requiring enhanced logical processing and coherent output generation.

Loading preview...

Model Overview

STRV/dpo-qwen-cot-merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged for direct use without adapter loading.

Key Optimizations

This model's primary objective is to enhance its reasoning capabilities, specifically focusing on Chain-of-Thought (CoT) processes, and to improve the quality of structured responses. This optimization was achieved through DPO training on a preference dataset (u-10bei/dpo-dataset-qwen-cot) over one epoch.

Technical Details

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • Fine-tuning Method: DPO
  • Max Sequence Length: 1024 (during training)
  • License: MIT License (derived from the dataset terms), with compliance to the original base model's license terms.

Usage

As a merged model, it can be directly loaded and used with the transformers library for inference, supporting a 32K context length.