Name: Yale-ROSE/Qwen3-4B-dpo_gpt-oss-120b_8k_reasoning_ablation API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Yale-ROSE

Overview

This model, developed by Yale-ROSE, is a fine-tuned variant of the Qwen3-4B architecture, featuring 4 billion parameters and a substantial 32768-token context window. It has been specifically trained using Direct Preference Optimization (DPO), a method detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." The training was conducted using the TRL framework, indicating a focus on aligning model outputs with desired human preferences.

Key Capabilities

Preference Alignment: Enhanced through DPO training, suggesting improved adherence to specified preferences in generated text.
General Text Generation: Capable of generating responses to a wide array of prompts, as demonstrated by the quick start example.
Extended Context: Benefits from a 32768-token context length, allowing for processing and generating longer, more coherent texts.

Training Details

The model's training leveraged TRL version 0.23.0, Transformers 4.56.1, Pytorch 2.7.1, Datasets 3.6.0, and Tokenizers 0.22.0. The DPO method aims to directly optimize a language model to act as a reward model, which can lead to more desirable and aligned outputs compared to traditional reinforcement learning from human feedback (RLHF) approaches.

Good For

Applications requiring models that generate text aligned with specific preferences.
General conversational AI and text completion tasks where context understanding is crucial.
Developers interested in exploring the effects of DPO on Qwen3-4B's performance.

Overview

Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)