Name: ml-engnr/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ml-engnr

Model Overview

The ml-engnr/dpo-qwen-cot-merged model is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. It utilizes Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs. This model is provided as full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, making it suitable for complex problem-solving tasks.
Structured Response Quality: DPO training specifically targeted improving the quality and structure of generated responses based on a preference dataset.
Direct Use: As a merged model, it can be directly loaded and used with the transformers library without additional configuration.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 5e-07 and a beta value of 0.4. The maximum sequence length used during training was 1024 tokens. LoRA configuration (r=8, alpha=16) was applied and subsequently merged into the base model.

Usage Considerations

This model is ideal for applications where robust reasoning and high-quality, structured outputs are critical. Users should adhere to the MIT License for the training data and the original base model's license terms for compliance.

Overview

Model Overview

Key Capabilities

Training Details

Usage Considerations

Full Model Card (README)