W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.01
W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.01 is an 8 billion parameter language model developed by W-61, fine-tuned from a Llama 3 base model. This model has undergone further optimization using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. It is designed for generating helpful and aligned responses, leveraging its 8192 token context length for various conversational and text generation tasks.
Loading preview...
Overview
This model, developed by W-61, is an 8 billion parameter language model based on the Llama 3 architecture. It is a fine-tuned iteration of W-61/llama-3-8b-base-sft-hh-helpful-4xh200, specifically optimized using Direct Preference Optimization (DPO).
Key Characteristics
- Base Model: Llama 3 8B.
- Fine-tuning: Utilizes Direct Preference Optimization (DPO) for enhanced alignment.
- Training Data: Fine-tuned on the Anthropic/hh-rlhf dataset, which focuses on helpfulness and harmlessness.
- Context Length: Supports an 8192 token context window.
Training Details
The model was trained with the following hyperparameters:
- Learning Rate: 5e-07
- Batch Size: A total training batch size of 64 (8 per device across 4 GPUs with 2 gradient accumulation steps).
- Optimizer: ADAMW_TORCH with standard betas and epsilon.
- Scheduler: Cosine learning rate scheduler with a 0.1 warmup ratio.
- Epochs: Trained for 1 epoch.
Intended Use Cases
While specific intended uses are not detailed in the README, the DPO fine-tuning on the Anthropic/hh-rlhf dataset suggests an optimization for generating helpful, safe, and aligned text, making it suitable for applications requiring robust conversational AI or instruction following.