Name: HuiyuWang/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: HuiyuWang

Model Overview

HuiyuWang/dpo-qwen-cot-merged is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base. It has undergone a sophisticated multi-stage fine-tuning process to enhance its capabilities, particularly in structured data transformation and Chain-of-Thought (CoT) reasoning. The training pipeline involved an initial Supervised Fine-Tuning (SFT) stage, followed by a refinement stage using 'hard' structured data, and finally, Direct Preference Optimization (DPO) to align the model with preferred CoT reasoning patterns.

Key Capabilities

Multi-stage Fine-Tuning: Combines SFT, hard data refinement, and DPO for robust performance.
Chain-of-Thought (CoT) Reasoning: Specifically aligned to generate step-by-step reasoning, with loss applied only to final outputs during SFT.
Structured Transformation: Enhanced for tasks involving the manipulation and transformation of structured data.
Preference Alignment: Utilizes DPO with (prompt, chosen, rejected) data to guide model behavior towards desired outputs.
Memory-Efficient Training: Fine-tuned using QLoRA and Unsloth for efficient 4-bit training.

Intended Use Cases

This model is particularly well-suited for:

Structured transformation tasks: Processing and converting structured data formats.
Chain-of-Thought reasoning: Generating detailed, step-by-step solutions to complex problems.
Preference-aligned generation: Producing outputs that adhere to specific desired patterns or styles.
Academic research experiments: Exploring multi-stage fine-tuning and preference learning techniques.
Competition submissions: As a robust foundation for AI challenges requiring reasoning and structured output.

Overview

Model Overview

Key Capabilities

Intended Use Cases

Full Model Card (README)