AlainGuillotin/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

AlainGuillotin/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-based causal language model fine-tuned by AlainGuillotin. It utilizes Direct Preference Optimization (DPO) to enhance reasoning (Chain-of-Thought) and structured response quality. This model, with a 32768 token context length, is optimized for generating aligned and coherent outputs based on preferred data. It is suitable for tasks requiring improved logical flow and structured text generation.

Loading preview...

Model Overview

This model, AlainGuillotin/dpo-qwen-cot-merged, is a fine-tuned version of the Qwen/Qwen3-4B-Instruct-2507 base model. It has been optimized using Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs.

Key Capabilities

  • Enhanced Reasoning: Specifically trained to improve Chain-of-Thought (CoT) reasoning abilities.
  • Structured Response Quality: Focuses on generating more structured and coherent outputs.
  • Full-Merged Weights: The repository contains the full-merged 16-bit weights, eliminating the need for adapter loading.

Training Details

  • Method: Direct Preference Optimization (DPO).
  • Base Model: Qwen/Qwen3-4B-Instruct-2507.
  • Dataset: Trained on the u-10bei/dpo-dataset-qwen-cot preference dataset.
  • Configuration: Trained for 1 epoch with a learning rate of 1e-07 and a beta of 0.1. The maximum sequence length used during training was 1024.

Usage

This model can be used directly with the transformers library for inference, as it provides merged weights. Users should adhere to the MIT License and the original base model's license terms.