Name: TSerizawa/llm-lecture-2025_sft-dpo-qwen-cot-merged-model API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: TSerizawa

Model Overview

The TSerizawa/llm-lecture-2025_sft-dpo-qwen-cot-merged-model is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, integrating the LoRA configuration directly into the base weights. This means the model can be used immediately with transformers without requiring separate adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized specifically to improve Chain-of-Thought (CoT) reasoning, leading to more logical and coherent responses.
Structured Output Quality: Focuses on generating higher quality structured responses, aligning with preferred output formats.
Direct Use: Provided as a full-merged 16-bit model, simplifying deployment and inference.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training data utilized for preference alignment is sourced from [u-10bei/dpo-dataset-qwen-cot].

Good For

This model is particularly well-suited for applications where improved reasoning, logical consistency, and structured output are critical. Its DPO fine-tuning makes it a strong candidate for tasks requiring nuanced understanding and generation of complex responses, especially those benefiting from Chain-of-Thought prompting.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)