Kyleyee/rDPO_hh-seed3

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 28, 2026Architecture:Transformer Cold

Kyleyee/rDPO_hh-seed3 is a 1.5 billion parameter language model fine-tuned by Kyleyee. It is based on Kyleyee/Qwen2.5-1.5B-sft-hh-3e and optimized using Direct Preference Optimization (DPO) on the Kyleyee/train_data_Helpful_drdpo_preference dataset. This model is designed to generate helpful and aligned responses, leveraging DPO for improved conversational quality. It offers a 32768 token context length, making it suitable for tasks requiring longer interactions.

Loading preview...

Model Overview

Kyleyee/rDPO_hh-seed3 is a 1.5 billion parameter language model developed by Kyleyee, building upon the base of Kyleyee/Qwen2.5-1.5B-sft-hh-3e. This model has been specifically fine-tuned using Direct Preference Optimization (DPO), a method that aligns language models with human preferences by treating the preference data as implicit rewards.

Key Capabilities

  • Preference-Aligned Responses: Optimized using DPO on a helpfulness dataset, making it adept at generating responses that are aligned with desired conversational qualities.
  • Efficient Fine-tuning: Leverages the TRL library for efficient training, demonstrating the application of advanced alignment techniques on smaller models.
  • Extended Context Window: Features a 32768 token context length, allowing for more extensive and coherent conversations or document processing.

Training Details

The model was trained on the Kyleyee/train_data_Helpful_drdpo_preference dataset. The DPO method, as introduced in "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," was utilized to enhance the model's ability to produce helpful outputs. This approach allows the model to implicitly learn a reward function from preference pairs, guiding its generation towards preferred responses without explicit reward modeling.