Kyleyee/DPO_hh-seed4
Kyleyee/DPO_hh-seed4 is a 1.5 billion parameter causal language model fine-tuned by Kyleyee using Direct Preference Optimization (DPO). This model is based on Kyleyee/Qwen2.5-1.5B-sft-hh-3e and specializes in generating helpful responses, trained on the Kyleyee/train_data_Helpful_drdpo_preference dataset. It is optimized for conversational AI tasks requiring helpful and aligned outputs within its 32768 token context window.
Loading preview...
Model Overview
Kyleyee/DPO_hh-seed4 is a 1.5 billion parameter language model developed by Kyleyee, fine-tuned from the Kyleyee/Qwen2.5-1.5B-sft-hh-3e base model. Its primary distinction lies in its training methodology: it utilizes Direct Preference Optimization (DPO), a technique designed to align language models with human preferences without the need for a separate reward model.
Key Capabilities
- Preference-aligned generation: Trained on the Kyleyee/train_data_Helpful_drdpo_preference dataset, this model is specifically optimized to produce helpful and preferred responses.
- Conversational AI: Excels in generating coherent and contextually relevant text for interactive applications, as demonstrated by its quick start example.
- Efficient fine-tuning: Leverages the TRL library for DPO training, indicating a focus on effective and scalable alignment techniques.
Training Details
The model was fine-tuned using the DPO method, as detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." This approach directly optimizes a policy to maximize the likelihood of preferred responses over dispreferred ones, making it effective for aligning model behavior with desired human feedback. The training utilized TRL, Transformers, Pytorch, Datasets, and Tokenizers libraries.