OpenRLHF/Llama-3-8b-rlhf-100k

Cold
Public
8B
FP8
8192
Hugging Face
Overview

OpenRLHF/Llama-3-8b-rlhf-100k Overview

This model is an 8 billion parameter Llama 3 variant developed by OpenRLHF, specifically fine-tuned using Reinforcement Learning from Human Feedback (RLHF). The training process involved 100,000 samples, utilizing a base Llama-3-8b-sft model and a Llama-3-8b-rm reward model. The primary goal of this RLHF fine-tuning was to enhance the model's ability to generate more aligned and contextually appropriate responses.

Key Capabilities & Training Details

  • Architecture: Llama 3, 8 billion parameters.
  • Fine-tuning Method: Reinforcement Learning from Human Feedback (RLHF).
  • Training Data: Leveraged OpenLLMAI/Llama-3-8b-sft-mixture as the base SFT model, OpenLLMAI/Llama-3-8b-rm-mixture as the reward model, and OpenLLMAI/prompt-collection-v0.1 for prompts.
  • Training Scale: Fine-tuned for 100,000 samples to optimize GPU resource usage.
  • Context Length: Supports a maximum prompt length of 2048 tokens and a maximum response length of 2048 tokens.
  • Performance Improvement: Achieved a score of 20.5 on Chat-Arena-Hard, significantly outperforming its llama-3-8b-sft base model which scored 5.6.

Good For

  • Chatbot Development: Ideal for applications requiring improved conversational quality and alignment.
  • Response Generation: Suitable for tasks where generating helpful and contextually relevant text is crucial.
  • Further RLHF Experimentation: Can serve as a strong base for additional RLHF fine-tuning or research due to its optimized training parameters and demonstrated performance uplift.