Name: keijiban3/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: keijiban3

Model Overview

This model, keijiban3/dpo-qwen-cot-merged, is a fine-tuned version of the Qwen/Qwen3-4B-Instruct-2507 base model, featuring approximately 0.5 billion parameters and a context length of 32768 tokens. It has been optimized using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged for direct use without adapter loading.

Key Capabilities

Enhanced Reasoning: Specifically trained to improve Chain-of-Thought (CoT) reasoning, making it suitable for tasks requiring multi-step logical deduction.
Structured Response Quality: Optimized to align responses with preferred outputs, leading to more coherent and structured generations.
DPO Fine-tuning: Leverages DPO to refine model behavior based on preference datasets, aiming for higher quality and more aligned outputs.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model.

Good For

Applications requiring improved logical reasoning and step-by-step explanations.
Generating structured outputs that adhere to specific formats or preferences.
Tasks where response quality and alignment with human preferences are critical.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)