Kyleyee/HINGE_hh-seed4

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 28, 2026Architecture:Transformer Cold

Kyleyee/HINGE_hh-seed4 is a 1.5 billion parameter instruction-tuned causal language model, fine-tuned by Kyleyee using Direct Preference Optimization (DPO). This model is based on Kyleyee/Qwen2.5-1.5B-sft-hh-3e and trained on a helpful preference dataset, making it suitable for generating helpful and aligned responses. With a context length of 32768 tokens, it is optimized for conversational AI and instruction-following tasks.

Loading preview...

Model Overview

Kyleyee/HINGE_hh-seed4 is a 1.5 billion parameter language model developed by Kyleyee, building upon the Kyleyee/Qwen2.5-1.5B-sft-hh-3e base model. It has been specifically fine-tuned using Direct Preference Optimization (DPO), a method designed to align language models with human preferences by treating the preference data as implicit reward signals. The training utilized the Kyleyee/train_data_Helpful_drdpo_preference dataset, focusing on enhancing helpfulness in its responses.

Key Capabilities

  • Preference-aligned responses: Trained with DPO to generate outputs that are more helpful and aligned with human preferences.
  • Instruction following: Optimized for tasks requiring the model to adhere to specific instructions.
  • Conversational AI: Suitable for dialogue systems and interactive applications due to its fine-tuning on a helpfulness dataset.

Training Details

The model was trained using the TRL (Transformer Reinforcement Learning) library. The DPO method, as introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," was central to its fine-tuning process. This approach leverages preference data to implicitly learn a reward model, guiding the language model towards desired behaviors without explicit reward modeling.

Good For

  • Applications requiring helpful and aligned text generation.
  • Instruction-based conversational agents.
  • Research into DPO and preference-based fine-tuning on smaller models.