Model Overview

This model, qwen3-4b-structeval-lora-36, is a 4 billion parameter variant of the Qwen3-Instruct architecture, specifically Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its LoRA configuration (r=8, alpha=16) fully merged into the base model's 16-bit weights.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning, leading to more logical and coherent outputs.
Structured Response Quality: Fine-tuned to produce higher quality structured responses, aligning with preferred output formats.
DPO Alignment: Leverages DPO to align model responses with specific desired outputs based on a preference dataset.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-05 and a beta value of 0.4. The maximum sequence length used during training was 1024 tokens. The training data utilized is u-10bei/dpo-dataset-qwen-cot.

Good For

Applications requiring improved reasoning and structured output generation.
Tasks where response alignment to specific preferences is crucial.
Developers looking for a Qwen3-based model with enhanced instruction following and structured output capabilities.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)