Name: helloworldabc/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: helloworldabc

Model Overview

This model, helloworldabc/dpo-qwen-cot-merged, is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, resulting in a fully merged 16-bit weight model that requires no adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, making it suitable for tasks requiring logical deduction and multi-step problem-solving.
Structured Response Quality: Fine-tuned to produce more coherent and structured outputs, aligning with preferred response formats.
DPO Alignment: Utilizes DPO to align model responses with human preferences, based on a specific preference dataset (u-10bei/dpo-dataset-qwen-cot).

Training Details

The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1. It supports a maximum sequence length of 1024 tokens during training. The LoRA configuration (r=8, alpha=16) was merged directly into the base model.

Usage

As a merged model, it can be directly integrated and used with the transformers library for inference, offering straightforward deployment for various applications.

Overview

Model Overview

Key Capabilities

Training Details

Usage

Full Model Card (README)