Kyleyee/VRPO_hh-seed1

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 23, 2026Architecture:Transformer Cold

Kyleyee/VRPO_hh-seed1 is a 1.5 billion parameter language model fine-tuned from Kyleyee/Qwen2.5-1.5B-sft-hh-3e. It was trained using the DRDPO method on the Kyleyee/train_data_Helpful_drdpo_preference dataset, specializing it for helpfulness. With a 32768-token context length, this model is optimized for generating helpful and preference-aligned text responses.

Loading preview...

Model Overview

Kyleyee/VRPO_hh-seed1 is a 1.5 billion parameter language model built upon the Kyleyee/Qwen2.5-1.5B-sft-hh-3e base model. It has been specifically fine-tuned using the DRDPO (Direct Preference Optimization) method, as introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." This training utilized the Kyleyee/train_data_Helpful_drdpo_preference dataset, focusing on aligning the model's outputs with human preferences for helpfulness.

Key Capabilities

  • Preference Alignment: Optimized to generate responses that are aligned with helpfulness preferences through DRDPO training.
  • Context Handling: Supports a substantial context length of 32768 tokens, allowing for processing and generating longer, more coherent texts.
  • Foundation Model: Serves as a fine-tuned version of a Qwen2.5-based model, inheriting its underlying architectural strengths.

Training Details

This model was trained using the TRL (Transformer Reinforcement Learning) framework. The DRDPO method is a key differentiator, aiming to directly optimize the language model based on preference data, effectively turning the LM into a reward model. The training process can be visualized via Weights & Biases logs, linked in the original repository.