W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.43
W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.43 is an 8 billion parameter language model fine-tuned by W-61. This model is a DPO-tuned variant of the Llama 3 8B base model, specifically optimized for helpfulness through training on the Anthropic/hh-rlhf dataset. It is designed to generate helpful and aligned responses, building upon its base model's capabilities.
Loading preview...
Model Overview
This model, W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.43, is an 8 billion parameter language model developed by W-61. It is a fine-tuned version of the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 model, further optimized using Direct Preference Optimization (DPO).
Key Characteristics
- Base Model: Built upon the Llama 3 8B architecture.
- Fine-tuning: Underwent DPO training on the
Anthropic/hh-rlhfdataset. - Optimization Goal: Primarily aimed at enhancing helpfulness and alignment in its responses.
Training Details
The model was trained with the following hyperparameters:
- Learning Rate: 5e-07
- Batch Size: 8 (train), 8 (eval) with 2 gradient accumulation steps, resulting in a total train batch size of 64.
- Optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08.
- Scheduler: Cosine learning rate scheduler with a 0.1 warmup ratio.
- Epochs: Trained for 1 epoch.
Intended Use
This model is suitable for applications requiring helpful and aligned text generation, leveraging its DPO fine-tuning on human preference data.