NLP-Final-Project/phi-2-ipo
NLP-Final-Project/phi-2-ipo is a 3 billion parameter language model fine-tuned from Microsoft's phi-2 architecture. This model was trained using Direct Preference Optimization (DPO) via the TRL library, enhancing its ability to align with human preferences. It is designed for text generation tasks, offering improved response quality through preference-based learning.
Loading preview...
Model Overview
NLP-Final-Project/phi-2-ipo is a 3 billion parameter language model, fine-tuned from the original microsoft/phi-2 base model. This fine-tuning process utilized Direct Preference Optimization (DPO), a method introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (paper link). The training was conducted using the TRL library, specifically TRL version 1.3.0.
Key Capabilities
- Preference-Aligned Text Generation: Enhanced to generate responses that better align with human preferences due to DPO training.
- Efficient Fine-tuning: Demonstrates the application of DPO for effective fine-tuning of smaller language models.
Training Details
The model's training procedure leveraged DPO, which directly optimizes a language model to align with human preferences without the need for a separate reward model. This approach aims to improve the quality and helpfulness of generated text. The training environment included Transformers 5.8.0, Pytorch 2.11.0, Datasets 4.8.5, and Tokenizers 0.22.2.