Overview
OpenRLHF/Llama-3-8b-rlhf-100k Overview
This model is an 8 billion parameter Llama 3 variant developed by OpenRLHF, specifically fine-tuned using Reinforcement Learning from Human Feedback (RLHF). The training process involved 100,000 samples, utilizing a base Llama-3-8b-sft model and a Llama-3-8b-rm reward model. The primary goal of this RLHF fine-tuning was to enhance the model's ability to generate more aligned and contextually appropriate responses.
Key Capabilities & Training Details
- Architecture: Llama 3, 8 billion parameters.
- Fine-tuning Method: Reinforcement Learning from Human Feedback (RLHF).
- Training Data: Leveraged
OpenLLMAI/Llama-3-8b-sft-mixtureas the base SFT model,OpenLLMAI/Llama-3-8b-rm-mixtureas the reward model, andOpenLLMAI/prompt-collection-v0.1for prompts. - Training Scale: Fine-tuned for 100,000 samples to optimize GPU resource usage.
- Context Length: Supports a maximum prompt length of 2048 tokens and a maximum response length of 2048 tokens.
- Performance Improvement: Achieved a score of 20.5 on Chat-Arena-Hard, significantly outperforming its
llama-3-8b-sftbase model which scored 5.6.
Good For
- Chatbot Development: Ideal for applications requiring improved conversational quality and alignment.
- Response Generation: Suitable for tasks where generating helpful and contextually relevant text is crucial.
- Further RLHF Experimentation: Can serve as a strong base for additional RLHF fine-tuning or research due to its optimized training parameters and demonstrated performance uplift.