mlfoundations-dev/simpo-oh-dcft-v3.1-llama-3.1-405b

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kLicense:llama3.1Architecture:Transformer Warm

The mlfoundations-dev/simpo-oh-dcft-v3.1-llama-3.1-405b is an 8 billion parameter language model fine-tuned by mlfoundations-dev, based on the Llama 3.1 architecture with a 32768 token context length. This model is a fine-tuned version of oh-dcft-v3.1-llama-3.1-405b, specifically optimized using the mlfoundations-dev/gemma2-ultrafeedback-armorm dataset. It demonstrates improved reward metrics and accuracy on its evaluation set, suggesting a focus on alignment or preference learning tasks.

Loading preview...

Model Overview

The mlfoundations-dev/simpo-oh-dcft-v3.1-llama-3.1-405b is an 8 billion parameter language model derived from the Llama 3.1 architecture, featuring a 32768 token context window. It is a fine-tuned iteration of the mlfoundations-dev/oh-dcft-v3.1-llama-3.1-405b base model.

Key Characteristics

  • Fine-tuning Objective: The model was fine-tuned on the mlfoundations-dev/gemma2-ultrafeedback-armorm dataset, indicating a focus on learning from human preferences or feedback.
  • Performance Metrics: During evaluation, the model achieved a loss of 2.5125, with a reward accuracy of 0.8018 and a reward margin of 7.8232. These metrics suggest an improvement in aligning with desired outputs based on the training data.

Training Details

The training process involved specific hyperparameters:

  • Learning Rate: 8e-07
  • Batch Sizes: train_batch_size of 2, eval_batch_size of 2, with a total_train_batch_size of 128 and total_eval_batch_size of 16 due to gradient accumulation.
  • Optimizer: AdamW with default betas and epsilon.
  • Epochs: Trained for 1.0 epoch.

Potential Use Cases

Given its fine-tuning on a feedback-oriented dataset, this model is likely suitable for applications requiring:

  • Response Generation: Generating outputs that align with human preferences or specific quality criteria.
  • Preference Learning Tasks: Scenarios where ranking or choosing between different responses is critical.

Limitations

The model card indicates that more information is needed regarding its intended uses, limitations, and the specifics of its training and evaluation data. Users should exercise caution and conduct further testing for specific applications.