Kyleyee/DPO_hh-seed1

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 23, 2026Architecture:Transformer Cold

Kyleyee/DPO_hh-seed1 is a 1.5 billion parameter language model fine-tuned from Kyleyee/Qwen2.5-1.5B-sft-hh-3e. This model was trained using Direct Preference Optimization (DPO) on the Kyleyee/train_data_Helpful_drdpo_preference dataset. It is designed to generate helpful and preferred responses, leveraging DPO to align with human preferences. The model has a context length of 32768 tokens, making it suitable for tasks requiring extensive contextual understanding.

Loading preview...

Model Overview

Kyleyee/DPO_hh-seed1 is a 1.5 billion parameter language model, building upon the Kyleyee/Qwen2.5-1.5B-sft-hh-3e base model. Its primary distinction lies in its training methodology: it has been fine-tuned using Direct Preference Optimization (DPO). This technique, detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," aims to align the model's outputs more closely with human preferences by directly optimizing a policy against a reference model.

Key Capabilities

  • Preference Alignment: Optimized to generate responses that are preferred by humans, based on the training data.
  • Contextual Understanding: Supports a substantial context length of 32768 tokens, allowing it to process and generate text based on extensive input.
  • Instruction Following: As a fine-tuned model, it is capable of following instructions to generate relevant and coherent text.

Training Details

This model was trained on the Kyleyee/train_data_Helpful_drdpo_preference dataset using the TRL library. The DPO method directly optimizes the policy to maximize the likelihood of preferred responses over dispreferred ones, without requiring a separate reward model.

Good For

  • Applications requiring models that generate helpful and human-aligned responses.
  • Tasks where preference-based fine-tuning is crucial for output quality.
  • Developers looking for a smaller, DPO-tuned model with a large context window for efficient deployment.