Kyleyee/VRPO_hh-seed1
Kyleyee/VRPO_hh-seed1 is a 1.5 billion parameter language model fine-tuned from Kyleyee/Qwen2.5-1.5B-sft-hh-3e. It was trained using the DRDPO method on the Kyleyee/train_data_Helpful_drdpo_preference dataset, specializing it for helpfulness. With a 32768-token context length, this model is optimized for generating helpful and preference-aligned text responses.
Loading preview...
Model Overview
Kyleyee/VRPO_hh-seed1 is a 1.5 billion parameter language model built upon the Kyleyee/Qwen2.5-1.5B-sft-hh-3e base model. It has been specifically fine-tuned using the DRDPO (Direct Preference Optimization) method, as introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." This training utilized the Kyleyee/train_data_Helpful_drdpo_preference dataset, focusing on aligning the model's outputs with human preferences for helpfulness.
Key Capabilities
- Preference Alignment: Optimized to generate responses that are aligned with helpfulness preferences through DRDPO training.
- Context Handling: Supports a substantial context length of 32768 tokens, allowing for processing and generating longer, more coherent texts.
- Foundation Model: Serves as a fine-tuned version of a Qwen2.5-based model, inheriting its underlying architectural strengths.
Training Details
This model was trained using the TRL (Transformer Reinforcement Learning) framework. The DRDPO method is a key differentiator, aiming to directly optimize the language model based on preference data, effectively turning the LM into a reward model. The training process can be visualized via Weights & Biases logs, linked in the original repository.