Name: Kyleyee/VRPO_hh-seed1 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Kyleyee

Model Overview

Kyleyee/VRPO_hh-seed1 is a 1.5 billion parameter language model built upon the Kyleyee/Qwen2.5-1.5B-sft-hh-3e base model. It has been specifically fine-tuned using the DRDPO (Direct Preference Optimization) method, as introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." This training utilized the Kyleyee/train_data_Helpful_drdpo_preference dataset, focusing on aligning the model's outputs with human preferences for helpfulness.

Key Capabilities

Preference Alignment: Optimized to generate responses that are aligned with helpfulness preferences through DRDPO training.
Context Handling: Supports a substantial context length of 32768 tokens, allowing for processing and generating longer, more coherent texts.
Foundation Model: Serves as a fine-tuned version of a Qwen2.5-based model, inheriting its underlying architectural strengths.

Training Details

This model was trained using the TRL (Transformer Reinforcement Learning) framework. The DRDPO method is a key differentiator, aiming to directly optimize the language model based on preference data, effectively turning the LM into a reward model. The training process can be visualized via Weights & Biases logs, linked in the original repository.

Overview

Model Overview

Key Capabilities

Training Details

Full Model Card (README)