Name: ogwata/exp7-dpo-baseline API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ogwata

Model Overview

The ogwata/exp7-dpo-baseline is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library, focusing on aligning its responses with preferred outputs.

Key Capabilities & Optimization

This model's primary optimization targets include:

Improved Reasoning: Enhanced ability to generate Chain-of-Thought (CoT) reasoning, leading to more logical and step-by-step problem-solving.
Structured Response Quality: Optimized for producing high-quality, structured outputs based on the preference dataset used during training.
DPO Alignment: Leverages DPO to better align model behavior with human preferences, reducing undesirable outputs and increasing helpfulness.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 5e-06 and a beta value of 0.1. It utilized a maximum sequence length of 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model, providing full 16-bit weights without requiring adapter loading.

Ideal Use Cases

This model is particularly well-suited for applications where:

Reasoning tasks are critical, especially those benefiting from explicit Chain-of-Thought generation.
Structured and aligned outputs are preferred, such as in data extraction, summarization, or controlled generation scenarios.
A 4B parameter model is desired for efficient deployment while still offering specialized performance in reasoning and response quality.

Overview

Model Overview

Key Capabilities & Optimization

Training Details

Ideal Use Cases

Full Model Card (README)