W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.5
W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.5 is an 8 billion parameter Llama 3 base model fine-tuned by W-61. This model is specifically optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, enhancing its helpfulness and alignment. With an 8192-token context length, it is designed for generating helpful and aligned responses in conversational AI applications.
Loading preview...
Model Overview
This model, llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.5, is an 8 billion parameter variant of the Llama 3 architecture. Developed by W-61, it is a fine-tuned iteration of the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 model.
Key Capabilities
- Preference Alignment: The model has undergone Direct Preference Optimization (DPO) using the
Anthropic/hh-rlhfdataset. This training method aims to align the model's outputs more closely with human preferences, particularly for helpfulness. - Base Model Enhancement: It builds upon a supervised fine-tuned (SFT) Llama 3 base, further refining its response generation capabilities.
Training Details
The fine-tuning process involved specific hyperparameters:
- Learning Rate: 5e-07
- Batch Size: A
train_batch_sizeof 8, with atotal_train_batch_sizeof 64 due to gradient accumulation. - Optimizer: AdamW with default betas and epsilon.
- Scheduler: Cosine learning rate scheduler with a 0.1 warmup ratio.
- Epochs: Trained for 1 epoch.
Intended Use Cases
Given its DPO fine-tuning on a helpfulness dataset, this model is well-suited for applications requiring:
- Generating helpful and aligned text.
- Conversational AI where user preference and safety are important.
- Tasks benefiting from models trained to reduce harmful or unhelpful outputs.