W-61/llama-3-8b-base-beta-dpo-hh-helpful-8xh200
W-61/llama-3-8b-base-beta-dpo-hh-helpful-8xh200 is an 8 billion parameter language model developed by W-61, fine-tuned from W-61/llama-3-8b-base-sft-hh-helpful-8xh200. This model has undergone further optimization using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. It is designed to enhance helpfulness and alignment, building upon its base Llama 3 architecture with a context length of 8192 tokens.
Loading preview...
Model Overview
This model, W-61/llama-3-8b-base-beta-dpo-hh-helpful-8xh200, is an 8 billion parameter language model developed by W-61. It is a fine-tuned iteration of the W-61/llama-3-8b-base-sft-hh-helpful-8xh200 model, specifically optimized using Direct Preference Optimization (DPO).
Key Characteristics
- Base Model: Fine-tuned from a Llama 3 8B base model.
- Fine-tuning: Utilizes Direct Preference Optimization (DPO) for alignment.
- Dataset: Trained on the Anthropic/hh-rlhf dataset, indicating a focus on helpfulness and harmlessness.
- Context Length: Supports a context window of 8192 tokens.
Training Details
The model was trained with a learning rate of 5e-07 over 1 epoch, using a total batch size of 128 across 8 GPUs. Evaluation metrics show a final loss of 0.6427 and a Beta Dpo/gap Mean of 20.0887, suggesting improved alignment during the DPO phase.
Potential Use Cases
Given its DPO fine-tuning on a helpfulness dataset, this model is likely suitable for applications requiring:
- Helpful and aligned responses: Generating user-friendly and constructive text.
- General-purpose conversational AI: Where safety and helpfulness are priorities.
- Further research into DPO and alignment techniques: As a base for experimental work.