W-61/llama-3-8b-base-margin-dpo-hh-helpful-8xh200
W-61/llama-3-8b-base-margin-dpo-hh-helpful-8xh200 is an 8 billion parameter language model, fine-tuned from W-61/llama-3-8b-base-sft-hh-helpful-8xh200 on the Anthropic/hh-rlhf dataset. This model utilizes a Direct Preference Optimization (DPO) approach to enhance helpfulness, achieving a loss of 0.4588 and a margin DPO mean of 11.1187. It is designed for applications requiring helpful and aligned text generation, building upon the Llama 3 architecture with an 8192-token context length.
Loading preview...
Overview
This model, W-61/llama-3-8b-base-margin-dpo-hh-helpful-8xh200, is an 8 billion parameter language model derived from the Llama 3 architecture. It has been fine-tuned using a Direct Preference Optimization (DPO) method on the Anthropic/hh-rlhf dataset, specifically building upon the W-61/llama-3-8b-base-sft-hh-helpful-8xh200 base model.
Key Capabilities
- Helpful Response Generation: Optimized through DPO on human feedback data to produce more helpful and aligned outputs.
- Llama 3 Architecture: Benefits from the foundational capabilities of the Llama 3 8B model.
- Context Length: Supports an 8192-token context window, allowing for processing and generating longer sequences of text.
Training Details
The model was trained with a learning rate of 5e-07, a batch size of 16 across 8 GPUs (total batch size 128), and a cosine learning rate scheduler with a 0.1 warmup ratio over 1 epoch. Evaluation metrics show a final loss of 0.4588 and a margin Dpo/margin Mean of 11.1187, indicating successful preference alignment during training.
Good For
- Applications requiring models that generate helpful and user-aligned responses.
- Tasks where DPO-based fine-tuning is beneficial for improving model behavior based on human preferences.
- Research and development in preference-based model alignment and helpfulness.