W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-eta-0.1-s_star-0.6-20260428-045924
The W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-eta-0.1-s_star-0.6-20260428-045924 model is an 8 billion parameter Llama 3 base model, fine-tuned by W-61 using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. This fine-tuning process aims to enhance helpfulness and alignment, making it suitable for conversational AI and instruction-following tasks. It leverages a context length of 8192 tokens, providing robust performance for generating coherent and contextually relevant responses.
Loading preview...
Overview
This model, developed by W-61, is an 8 billion parameter Llama 3 base model that has undergone fine-tuning using Direct Preference Optimization (DPO). The training specifically utilized the Anthropic/hh-rlhf dataset, indicating an emphasis on aligning the model's outputs with human preferences for helpfulness and safety.
Key Capabilities
- Preference Alignment: Fine-tuned with DPO on the Anthropic/hh-rlhf dataset, suggesting improved helpfulness and reduced harmfulness in responses.
- Llama 3 Architecture: Benefits from the foundational capabilities of the Llama 3 8B base model.
- Context Handling: Supports a context length of 8192 tokens, enabling it to process and generate longer, more detailed interactions.
Training Details
The model was trained with a learning rate of 5e-07, a total batch size of 64, and utilized the AdamW optimizer. The training involved 4 GPUs with a gradient accumulation of 2 steps over 1 epoch. This configuration is designed to optimize the model's performance on preference-aligned tasks.
Good for
- Developing conversational AI agents that require helpful and aligned responses.
- Applications where human preference alignment is a critical factor.
- Instruction-following tasks where the model needs to adhere to specific guidelines.