W-61/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200
The W-61/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200 is an 8 billion parameter language model, fine-tuned by W-61 from a Llama-3-8B base model. This model was specifically optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, focusing on helpfulness. It is designed for applications requiring helpful and aligned text generation, building upon its Llama-3-8B foundation with an 8192 token context length.
Loading preview...
Overview
This model, llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200, is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the W-61/llama-3-8b-base-sft-hh-helpful-8xh200 model, specifically optimized using Direct Preference Optimization (DPO).
Key Capabilities
- Preference Alignment: Fine-tuned on the Anthropic/hh-rlhf dataset, indicating an emphasis on generating helpful and aligned responses.
- DPO Training: Utilizes Direct Preference Optimization, a method known for aligning models with human preferences without complex reinforcement learning setups.
- Performance Metrics: Achieved a rewards accuracy of 0.6402 and a chosen-rejected reward margin of 0.1454 on its evaluation set, demonstrating its ability to differentiate preferred responses.
Training Details
The model underwent a single epoch of training with a learning rate of 5e-07, a total batch size of 128, and utilized 8 GPUs. The training process involved an AdamW optimizer and a cosine learning rate scheduler with a 0.1 warmup ratio.
Good For
This model is suitable for use cases where generating helpful, aligned, and preference-optimized text is crucial, leveraging its DPO fine-tuning on a robust Llama-3-8B base.