Name: mark-22/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: mark-22

Model Overview

The mark-22/dpo-qwen-cot-merged model is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library to align its responses with preferred outputs. This model is provided with full-merged 16-bit weights, eliminating the need for adapter loading.

Key Optimizations & Features

Enhanced Reasoning (Chain-of-Thought): The primary objective of the DPO fine-tuning was to significantly improve the model's reasoning abilities, especially in Chain-of-Thought (CoT) processes.
Improved Structured Responses: Optimization also focused on generating higher quality and more structured outputs, based on the preference dataset used during training.
DPO Method: Utilizes Direct Preference Optimization for alignment, trained for 1 epoch with a learning rate of 1e-07 and a beta of 0.1.
Merged Weights: The model includes merged LoRA configurations (r=64, alpha=64) directly into the base model, simplifying deployment.
Context Length: Supports a maximum sequence length of 1024 tokens during training.

Ideal Use Cases

This model is particularly well-suited for applications where:

Complex Reasoning is required, benefiting from its enhanced Chain-of-Thought capabilities.
Aligned and Coherent Outputs are critical, leveraging its DPO-based preference alignment.
Structured Response Generation is a priority, such as in question-answering or data extraction tasks where output format matters.

Users should adhere to the MIT License of the training data (u-10bei/dpo-dataset-qwen-cot) and the original base model's license terms.

Overview

Model Overview

Key Optimizations & Features

Ideal Use Cases

Full Model Card (README)