W-61/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.45-eta-0.1-s_star-0.6-20260428-045924
W-61/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.45-eta-0.1-s_star-0.6-20260428-045924 is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the Llama 3 base architecture, specifically optimized through DPO (Direct Preference Optimization) on the Anthropic/hh-rlhf dataset. This model is designed for harmless and helpful conversational AI, leveraging its 8192 token context length for robust interaction.
Loading preview...
Overview
This model, developed by W-61, is an 8 billion parameter language model based on the Llama 3 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, which is known for its focus on helpful and harmless AI responses. The training process involved specific hyperparameters including a learning rate of 5e-07, a total batch size of 64, and a cosine learning rate scheduler over one epoch.
Key Capabilities
- Harmless and Helpful Responses: Optimized through DPO on the Anthropic/hh-rlhf dataset to generate responses that are aligned with safety and helpfulness guidelines.
- Llama 3 Base: Leverages the foundational capabilities of the Llama 3 8B base model.
- Context Length: Supports an 8192 token context window, enabling more extensive and coherent conversations.
Training Details
The model was trained with a learning rate of 5e-07, a batch size of 8 per device across 4 GPUs, and a gradient accumulation of 2, resulting in an effective total training batch size of 64. The optimizer used was ADAMW_TORCH, and the learning rate schedule followed a cosine decay with a 0.1 warmup ratio over a single epoch. The training utilized Transformers 4.51.0, Pytorch 2.3.1+cu121, Datasets 2.21.0, and Tokenizers 0.21.4.