Kyleyee/VRPO_hh-seed3
Kyleyee/VRPO_hh-seed3 is a 1.5 billion parameter causal language model fine-tuned by Kyleyee. It is based on Kyleyee/Qwen2.5-1.5B-sft-hh-3e and optimized using DRDPO (Direct Preference Optimization) on a helpfulness preference dataset. This model is designed to generate helpful and aligned responses, leveraging preference-based training for improved conversational quality.
Loading preview...
Model Overview
Kyleyee/VRPO_hh-seed3 is a 1.5 billion parameter language model developed by Kyleyee, fine-tuned from the Kyleyee/Qwen2.5-1.5B-sft-hh-3e base model. It has a context length of 32768 tokens. The model's primary differentiation lies in its training methodology, utilizing DRDPO (Direct Preference Optimization) on a specific helpfulness preference dataset (Kyleyee/train_data_Helpful_drdpo_preference). This approach aims to align the model's outputs with human preferences for helpfulness.
Key Capabilities
- Preference-aligned generation: Optimized to produce responses that are perceived as more helpful based on direct preference feedback.
- Conversational AI: Suitable for tasks requiring engaging and helpful dialogue.
- Efficient inference: With 1.5 billion parameters, it offers a balance between performance and computational efficiency.
Training Details
The model was trained using the TRL framework, specifically implementing the DRDPO method. DRDPO is a technique introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (arXiv:2305.18290), which directly optimizes a language model using human preference data without requiring a separate reward model. This training paradigm enhances the model's ability to generate preferred responses.