follmon10/qwen3-4b-dpo-qwen-cot-merged_v1

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

follmon10/qwen3-4b-dpo-qwen-cot-merged_v1 is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507. It utilizes Direct Preference Optimization (DPO) to enhance reasoning capabilities, specifically Chain-of-Thought (CoT), and improve structured response quality. This model is optimized for generating aligned and coherent outputs based on preferred examples, making it suitable for tasks requiring improved logical flow and structured answers.

Loading preview...

Model Overview

follmon10/qwen3-4b-dpo-qwen-cot-merged_v1 is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has undergone fine-tuning using Direct Preference Optimization (DPO) via the Unsloth library, resulting in a merged 16-bit weight model that requires no adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized through DPO to improve Chain-of-Thought (CoT) reasoning, leading to more logical and structured responses.
  • Preference Alignment: Trained to align its outputs with preferred examples, ensuring higher quality and more desirable generations.
  • Structured Response Quality: Focuses on generating well-structured and coherent answers based on the provided preference dataset.
  • Direct Usage: As a fully merged model, it can be used directly with the transformers library for inference.

Training Details

The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training data utilized was u-10bei/dpo-dataset-qwen-cot.

Good For

  • Applications requiring improved reasoning and logical flow in generated text.
  • Tasks where structured and aligned responses are critical.
  • Developers seeking a compact 4B parameter model with enhanced CoT capabilities.