W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-8
W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-8 is an 8 billion parameter language model fine-tuned by W-61, based on the Llama 3 architecture. This model was specifically fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, aiming to enhance helpfulness and align with human preferences. It is designed for conversational AI applications requiring helpful and aligned responses, leveraging a context length of 8192 tokens.
Loading preview...
Model Overview
This model, llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-8, is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 base model.
Key Characteristics
- Base Model: Llama 3 8B architecture.
- Fine-tuning Method: Utilizes Direct Preference Optimization (DPO).
- Training Data: Fine-tuned on the Anthropic/hh-rlhf dataset, which is designed to improve helpfulness and reduce harmfulness through human feedback.
- Context Length: Supports a context window of 8192 tokens.
Training Details
The model underwent a single epoch of training with a learning rate of 5e-07 and a total batch size of 64 across 4 GPUs. The optimizer used was AdamW with cosine learning rate scheduling and a warmup ratio of 0.1.
Intended Use Cases
This model is primarily intended for applications where generating helpful, harmless, and aligned responses is crucial. Its DPO fine-tuning on a human feedback dataset suggests suitability for:
- Chatbots and conversational AI systems.
- Assistants requiring helpful and preference-aligned outputs.
- Tasks benefiting from improved instruction following and safety.