Model Overview
The mlfoundations-dev/simpo-oh_teknium_scaling_down_ratiocontrolled_0.9 is an 8 billion parameter language model, fine-tuned from the base model mlfoundations-dev/oh_teknium_scaling_down_ratiocontrolled_0.9. It was trained using the mlfoundations-dev/gemma2-ultrafeedback-armorm dataset, indicating a focus on preference learning or alignment tasks.
Key Performance Metrics
During its single epoch of training, the model achieved notable results on the evaluation set:
- Loss: 2.9107
- Rewards/accuracies: 0.7604
- Rewards/margins: 5.3780
These metrics suggest its performance in distinguishing between preferred and rejected responses, which is typical for models fine-tuned with reinforcement learning from human feedback (RLHF) or similar preference-based methods.
Training Details
The model was trained with a learning rate of 8e-07, a total batch size of 128, and utilized 8 GPUs. The training procedure included a cosine learning rate scheduler with a warmup ratio of 0.1 over 1.0 epoch. The context length for this model is 32768 tokens.
Intended Use Cases
Given its fine-tuning on a feedback dataset, this model is likely suitable for applications requiring:
- Preference-aligned text generation: Generating responses that align with human preferences.
- Dialogue systems: Improving the quality and helpfulness of conversational AI.
- Content moderation: Identifying and filtering undesirable content based on learned preferences.