Kyleyee/HINGE_hh-seed4
Kyleyee/HINGE_hh-seed4 is a 1.5 billion parameter instruction-tuned causal language model, fine-tuned by Kyleyee using Direct Preference Optimization (DPO). This model is based on Kyleyee/Qwen2.5-1.5B-sft-hh-3e and trained on a helpful preference dataset, making it suitable for generating helpful and aligned responses. With a context length of 32768 tokens, it is optimized for conversational AI and instruction-following tasks.
Loading preview...
Model Overview
Kyleyee/HINGE_hh-seed4 is a 1.5 billion parameter language model developed by Kyleyee, building upon the Kyleyee/Qwen2.5-1.5B-sft-hh-3e base model. It has been specifically fine-tuned using Direct Preference Optimization (DPO), a method designed to align language models with human preferences by treating the preference data as implicit reward signals. The training utilized the Kyleyee/train_data_Helpful_drdpo_preference dataset, focusing on enhancing helpfulness in its responses.
Key Capabilities
- Preference-aligned responses: Trained with DPO to generate outputs that are more helpful and aligned with human preferences.
- Instruction following: Optimized for tasks requiring the model to adhere to specific instructions.
- Conversational AI: Suitable for dialogue systems and interactive applications due to its fine-tuning on a helpfulness dataset.
Training Details
The model was trained using the TRL (Transformer Reinforcement Learning) library. The DPO method, as introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," was central to its fine-tuning process. This approach leverages preference data to implicitly learn a reward model, guiding the language model towards desired behaviors without explicit reward modeling.
Good For
- Applications requiring helpful and aligned text generation.
- Instruction-based conversational agents.
- Research into DPO and preference-based fine-tuning on smaller models.