TSerizawa/llm-lecture-2025_dpo-qwen-cot-merged_base_model

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 3, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

TSerizawa/llm-lecture-2025_dpo-qwen-cot-merged_base_model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507, optimized using Direct Preference Optimization (DPO) via Unsloth. This 4 billion parameter model focuses on improving reasoning capabilities through Chain-of-Thought (CoT) and enhancing structured response quality. It is designed for applications requiring aligned, high-quality outputs in reasoning tasks.

Loading preview...

Model Overview

This model, TSerizawa/llm-lecture-2025_dpo-qwen-cot-merged_base_model, is a specialized variant of the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library, to align its responses with preferred outputs.

Key Capabilities & Optimization

  • Enhanced Reasoning: The primary objective of this DPO fine-tuning was to improve the model's reasoning abilities, particularly through Chain-of-Thought (CoT) processes.
  • Structured Response Quality: It is optimized to produce higher quality and more structured responses based on a preference dataset.
  • Full-Merged Weights: The repository provides the full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.

Training Details

  • Methodology: DPO was applied over 1 epoch with a learning rate of 1e-07 and a beta value of 0.1.
  • Context Length: The training utilized a maximum sequence length of 1024 tokens.
  • LoRA Configuration: LoRA (r=8, alpha=16) was used during training and subsequently merged into the base model.

Intended Use Cases

This model is particularly well-suited for applications where:

  • Improved Reasoning is critical, especially for tasks benefiting from Chain-of-Thought prompting.
  • High-Quality, Aligned Outputs are required, reflecting preferred response styles and structures.

Licensing

The model's training data is sourced from u-10bei/dpo-dataset-qwen-cot under an MIT License. Users must also comply with the original base model's license terms.