Name: trl-lib/Qwen2-0.5B-DPO API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: trl-lib

Model Overview

trl-lib/Qwen2-0.5B-DPO is a 0.5 billion parameter language model derived from the Qwen2-0.5B-Instruct architecture. It has been specifically fine-tuned using Direct Preference Optimization (DPO), a method designed to align language models with human preferences by leveraging preference data. This training was conducted using the TRL library and the trl-lib/Capybara-Preferences dataset.

Key Capabilities

Preference Alignment: Optimized to generate responses that better align with human preferences, as learned from the Capybara-Preferences dataset.
Instruction Following: Builds upon the instruction-tuned base model, Qwen2-0.5B-Instruct, to enhance its ability to follow user instructions.
Efficient Deployment: As a 0.5 billion parameter model, it offers a balance between performance and computational efficiency.

Training Details

The model's training procedure utilized DPO, as detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." The training environment included TRL version 0.12.0.dev0 and Transformers version 4.45.0.dev0.

Good For

Applications requiring models that generate human-preferred responses.
Scenarios where a smaller, preference-aligned model is beneficial for deployment constraints.
Further research and experimentation with DPO fine-tuning techniques on Qwen2 architectures.