Model Overview
kabuizuchi-trading/dpo-qwen-cot-merged is a fine-tuned version of the kabuizuchi-trading/qwen3-4b-lora-structured base model. It leverages Direct Preference Optimization (DPO), implemented with the Unsloth library, to align its responses with preferred outputs. This model specifically targets improvements in reasoning (Chain-of-Thought) and the generation of structured responses.
Key Features & Training Details
- Optimization Method: Direct Preference Optimization (DPO).
- Base Model:
kabuizuchi-trading/qwen3-4b-lora-structured. - Training Objective: Enhance reasoning capabilities and structured output quality based on a preference dataset (
u-10bei/dpo-dataset-qwen-cot). - Merged Weights: This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading.
- Training Configuration: Trained for 1 epoch with a learning rate of 3e-07, beta of 0.08, and a maximum sequence length of 2048. LoRA configuration (r=8, alpha=16) was merged into the base model.
Usage
This model can be used directly with the transformers library for inference, as it contains merged weights. It is suitable for tasks requiring improved reasoning and structured output generation.
License
The model and its training data are distributed under the MIT License. Users must also adhere to the original base model's license terms.