mlfoundations-dev/simpo-oh-dcft-v3.1-llama-3.1-405b
The mlfoundations-dev/simpo-oh-dcft-v3.1-llama-3.1-405b is an 8 billion parameter language model fine-tuned by mlfoundations-dev, based on the Llama 3.1 architecture with a 32768 token context length. This model is a fine-tuned version of oh-dcft-v3.1-llama-3.1-405b, specifically optimized using the mlfoundations-dev/gemma2-ultrafeedback-armorm dataset. It demonstrates improved reward metrics and accuracy on its evaluation set, suggesting a focus on alignment or preference learning tasks.
Loading preview...
Model Overview
The mlfoundations-dev/simpo-oh-dcft-v3.1-llama-3.1-405b is an 8 billion parameter language model derived from the Llama 3.1 architecture, featuring a 32768 token context window. It is a fine-tuned iteration of the mlfoundations-dev/oh-dcft-v3.1-llama-3.1-405b base model.
Key Characteristics
- Fine-tuning Objective: The model was fine-tuned on the
mlfoundations-dev/gemma2-ultrafeedback-armormdataset, indicating a focus on learning from human preferences or feedback. - Performance Metrics: During evaluation, the model achieved a loss of 2.5125, with a reward accuracy of 0.8018 and a reward margin of 7.8232. These metrics suggest an improvement in aligning with desired outputs based on the training data.
Training Details
The training process involved specific hyperparameters:
- Learning Rate: 8e-07
- Batch Sizes:
train_batch_sizeof 2,eval_batch_sizeof 2, with atotal_train_batch_sizeof 128 andtotal_eval_batch_sizeof 16 due to gradient accumulation. - Optimizer: AdamW with default betas and epsilon.
- Epochs: Trained for 1.0 epoch.
Potential Use Cases
Given its fine-tuning on a feedback-oriented dataset, this model is likely suitable for applications requiring:
- Response Generation: Generating outputs that align with human preferences or specific quality criteria.
- Preference Learning Tasks: Scenarios where ranking or choosing between different responses is critical.
Limitations
The model card indicates that more information is needed regarding its intended uses, limitations, and the specifics of its training and evaluation data. Users should exercise caution and conduct further testing for specific applications.