jackf857/llama-3-8b-base-margin-dpo-hh-helpful-batch-64
The jackf857/llama-3-8b-base-margin-dpo-hh-helpful-batch-64 model is an 8 billion parameter Llama 3 base model fine-tuned using Margin DPO on the Anthropic/hh-rlhf dataset. This model is optimized for helpfulness, building upon a previously SFT-tuned Llama 3 variant. It is designed for tasks requiring helpful and aligned responses, leveraging its 8192 token context length.
Loading preview...
Model Overview
This model, jackf857/llama-3-8b-base-margin-dpo-hh-helpful-batch-64, is an 8 billion parameter Llama 3 base model. It has been fine-tuned using the Margin DPO (Direct Preference Optimization) method, building upon a prior Supervised Fine-Tuning (SFT) of a Llama 3 base model. The training utilized the Anthropic/hh-rlhf dataset, which is known for aligning models with human preferences for helpfulness.
Key Training Details
- Base Model: W-61/llama-3-8b-base-sft-hh-helpful-4xh200
- Fine-tuning Method: Margin DPO
- Dataset: Anthropic/hh-rlhf
- Training Hyperparameters:
- Learning Rate: 5e-07
- Total Train Batch Size: 64
- Number of Epochs: 1
- Evaluation Metrics: Achieved a final loss of 0.4046, with a Margin Dpo/loss Margin Mean of 21.7563, indicating effective preference learning.
Intended Use Cases
This model is particularly suited for applications where generating helpful and aligned text is crucial. Its fine-tuning on the Anthropic/hh-rlhf dataset suggests a strong emphasis on producing responses that are considered helpful and harmless by human evaluators. Developers can leverage this model for tasks requiring robust and preference-aligned language generation.