wxzhang/selective-pairrm-33045197-mt0
The wxzhang/selective-pairrm-33045197-mt0 model is a 7 billion parameter instruction-tuned language model, fine-tuned by wxzhang, based on Mistral-7B-Instruct-v0.2. This model was trained using a DPO (Direct Preference Optimization) approach on the snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset, achieving a rewards accuracy of 0.6055 on the evaluation set. It is designed to generate responses aligned with human preferences, making it suitable for tasks requiring nuanced understanding and preferred output generation.
Loading preview...
Model Overview
The wxzhang/selective-pairrm-33045197-mt0 is a 7 billion parameter language model derived from mistralai/Mistral-7B-Instruct-v0.2. It has been fine-tuned using Direct Preference Optimization (DPO) on the snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset.
Key Characteristics
- Base Model: Mistral-7B-Instruct-v0.2
- Fine-tuning Method: Direct Preference Optimization (DPO)
- Training Dataset: snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset
- Evaluation Performance: Achieved a rewards accuracy of 0.6055 on the evaluation set, with a final loss of 0.6825.
Training Details
The model was trained for 1 epoch with a learning rate of 5e-07, a total training batch size of 64, and an Adam optimizer. The training process involved 4 devices with a gradient accumulation of 4 steps.
Potential Use Cases
This model is particularly suited for applications where generating responses that align with specific preferences or human feedback is crucial. Its DPO fine-tuning suggests an ability to differentiate between preferred and rejected outputs, making it valuable for tasks like:
- Preference-aligned text generation
- Response ranking and selection
- Dialogue systems requiring nuanced output