Kyleyee/DPO_hh-seed5

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 23, 2026Architecture:Transformer Cold

Kyleyee/DPO_hh-seed5 is a 1.5 billion parameter language model fine-tuned by Kyleyee, based on the Qwen2.5 architecture, with a context length of 32768 tokens. This model was trained using Direct Preference Optimization (DPO) on a helpfulness preference dataset. It is specifically optimized for generating helpful and aligned responses, building upon its Qwen2.5-1.5B-sft-hh-3e base.

Loading preview...

Model Overview

Kyleyee/DPO_hh-seed5 is a 1.5 billion parameter language model developed by Kyleyee, fine-tuned from the Qwen2.5-1.5B-sft-hh-3e base model. It leverages a substantial context length of 32768 tokens, making it suitable for processing longer inputs and generating comprehensive outputs.

Key Capabilities

  • Direct Preference Optimization (DPO): The model was trained using the DPO method, which aligns language models with human preferences by directly optimizing a policy against a reward model implicitly defined by the preferences. This technique is detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (arXiv:2305.18290).
  • Helpfulness Alignment: Fine-tuned on the Kyleyee/train_data_Helpful_drdpo_preference dataset, this model is specifically designed to generate helpful responses.
  • TRL Framework: The training was conducted using the TRL (Transformer Reinforcement Learning) library, a framework for training language models with reinforcement learning.

When to Use This Model

This model is particularly well-suited for applications requiring a smaller, efficient language model that can generate helpful and preference-aligned text. Its DPO training makes it a strong candidate for tasks where response quality and alignment with human preferences are critical, such as chatbots, content generation, or summarization where helpfulness is a key metric.