Yale-ROSE/Qwen3-4B-dpo_gpt-oss-120b_8k_reasoning_ablation
Yale-ROSE/Qwen3-4B-dpo_gpt-oss-120b_8k_reasoning_ablation is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B by Yale-ROSE. This model utilizes Direct Preference Optimization (DPO) for training, enhancing its ability to align with human preferences. With a context length of 32768 tokens, it is designed for general text generation tasks, particularly benefiting from DPO's alignment capabilities.
Loading preview...
Overview
This model, developed by Yale-ROSE, is a fine-tuned variant of the Qwen3-4B architecture, featuring 4 billion parameters and a substantial 32768-token context window. It has been specifically trained using Direct Preference Optimization (DPO), a method detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." The training was conducted using the TRL framework, indicating a focus on aligning model outputs with desired human preferences.
Key Capabilities
- Preference Alignment: Enhanced through DPO training, suggesting improved adherence to specified preferences in generated text.
- General Text Generation: Capable of generating responses to a wide array of prompts, as demonstrated by the quick start example.
- Extended Context: Benefits from a 32768-token context length, allowing for processing and generating longer, more coherent texts.
Training Details
The model's training leveraged TRL version 0.23.0, Transformers 4.56.1, Pytorch 2.7.1, Datasets 3.6.0, and Tokenizers 0.22.0. The DPO method aims to directly optimize a language model to act as a reward model, which can lead to more desirable and aligned outputs compared to traditional reinforcement learning from human feedback (RLHF) approaches.
Good For
- Applications requiring models that generate text aligned with specific preferences.
- General conversational AI and text completion tasks where context understanding is crucial.
- Developers interested in exploring the effects of DPO on Qwen3-4B's performance.