W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.05
W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.05 is an 8 billion parameter language model fine-tuned by W-61. It is based on a Llama 3 architecture and has been further optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. This model is designed for helpful conversational AI applications, leveraging its 8192 token context length for extended interactions.
Loading preview...
Model Overview
This model, llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.05, is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 base model, specifically enhanced through Direct Preference Optimization (DPO).
Key Characteristics
- Base Architecture: Built upon the Llama 3 family of models.
- Parameter Count: Features 8 billion parameters, offering a balance between performance and computational efficiency.
- Context Length: Supports an 8192 token context window, enabling the processing of longer inputs and generating more coherent, extended responses.
- Fine-tuning Method: Utilizes Direct Preference Optimization (DPO) for alignment, trained on the
Anthropic/hh-rlhfdataset. This method aims to improve the model's helpfulness and adherence to human preferences.
Training Details
The model underwent a single epoch of training with a learning rate of 5e-07 and a total batch size of 64 across 4 GPUs. It employed the AdamW optimizer with a cosine learning rate scheduler and a 0.1 warmup ratio. The training leveraged Transformers 4.51.0, PyTorch 2.3.1+cu121, Datasets 2.21.0, and Tokenizers 0.21.4.
Intended Use Cases
Given its DPO fine-tuning on a helpfulness dataset, this model is primarily intended for applications requiring helpful and aligned conversational AI. It is suitable for tasks where generating informative, safe, and user-friendly responses is crucial.