Model Overview
The mlfoundations-dev/simpo-evol_tt_5s is an 8 billion parameter language model, fine-tuned from the mlfoundations-dev/evol_tt_5s base model. The fine-tuning process utilized the mlfoundations-dev/gemma2-ultrafeedback-armorm dataset, aiming to enhance its ability to align with human preferences and distinguish between chosen and rejected responses.
Key Performance Metrics
During evaluation, the model achieved notable results:
- Rewards/accuracies: 0.8001, indicating a strong ability to correctly identify preferred responses.
- Rewards/margins: 7.3488, reflecting a clear distinction between chosen and rejected outputs.
- Loss: 2.4559 on the evaluation set.
Training Details
The model was trained with a learning rate of 8e-07 over 1 epoch, using a total batch size of 128 across 8 GPUs. The training incorporated a cosine learning rate scheduler with a 0.1 warmup ratio, utilizing the AdamW optimizer. This configuration aimed to optimize the model's performance in preference alignment tasks.
Intended Use Cases
While specific intended uses are not detailed in the original README, the fine-tuning on a feedback-oriented dataset suggests its suitability for applications requiring:
- Response quality assessment: Distinguishing between high-quality and lower-quality generated text.
- Preference alignment: Generating outputs that are more likely to be preferred by users.
- Reinforcement Learning from Human Feedback (RLHF) related tasks: Serving as a component in systems that learn from human preferences.