Kyleyee/VRPO_hh-seed5

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 23, 2026Architecture:Transformer Cold

Kyleyee/VRPO_hh-seed5 is a 1.5 billion parameter language model fine-tuned by Kyleyee, based on the Qwen2.5-1.5B-sft-hh-3e architecture. This model was specifically trained using the DRDPO method on a helpfulness preference dataset, optimizing its ability to generate helpful and aligned responses. With a context length of 32768 tokens, it is designed for conversational AI and instruction-following tasks where helpfulness is a key requirement.

Loading preview...

Model Overview

Kyleyee/VRPO_hh-seed5 is a 1.5 billion parameter language model developed by Kyleyee, building upon the Qwen2.5-1.5B-sft-hh-3e base model. Its primary distinction lies in its fine-tuning process, which utilized the DRDPO (Direct Preference Optimization) method. This technique, detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" [http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html], aims to align the model's outputs with human preferences by directly optimizing against a reward model.

Key Capabilities

  • Preference Alignment: Specifically trained on the Kyleyee/train_data_Helpful_drdpo_preference dataset to enhance helpfulness in responses.
  • Instruction Following: Optimized for generating outputs that adhere to user instructions, benefiting from the DRDPO fine-tuning.
  • Efficient Performance: At 1.5 billion parameters, it offers a balance between performance and computational efficiency for preference-aligned tasks.
  • Extended Context: Supports a context length of 32768 tokens, allowing for processing longer prompts and maintaining conversational coherence.

Training Details

The model was fine-tuned using the TRL (Transformer Reinforcement Learning) library [https://github.com/huggingface/trl], leveraging the DRDPO algorithm. This approach directly optimizes the policy to maximize the likelihood of preferred responses over dispreferred ones, making it particularly suitable for applications requiring robust and helpful conversational agents.