RLHFlow/pair-preference-model-LLaMA3-8B
The RLHFlow/pair-preference-model-LLaMA3-8B is an 8 billion parameter preference model, based on LLaMA3-8B-it, developed by RLHFlow. It is specifically trained to rank pairs of responses, excelling in evaluating conversational quality, safety, and reasoning. This model is designed for integration into RLHF workflows to identify preferred outputs from language models.
Loading preview...
RLHFlow/pair-preference-model-LLaMA3-8B Overview
This model is an 8 billion parameter preference model, fine-tuned from meta-llama/Meta-Llama-3-8B-Instruct by RLHFlow. Its primary function is to evaluate and rank pairs of responses, making it a crucial component in Reinforcement Learning from Human Feedback (RLHF) pipelines.
Key Capabilities
- Response Ranking: Designed to compare two given responses (A and B) and determine which is preferred based on learned preferences.
- Performance Metrics: Achieves strong results on reward benchmarks, including Chat-98.6, Char-hard 65.8, Safety 89.6, and Reasoning 94.9.
- Multi-turn Conversation Support: Capable of handling preference ranking within multi-turn conversational contexts.
- Bias Mitigation: Implements response swapping during evaluation to mitigate positional bias in ranking.
Training and Methodology
The model was trained using the RLHFlow/pair_preference_model_dataset and leverages a training script from the RLHF-Reward-Modeling repository. The underlying methodology is detailed in the paper "RLHF Workflow: From Reward Modeling to Online RLHF" (TMLR, 2024), which describes the broader RLHF framework.
Use Cases
This model is ideal for:
- Automated Preference Labeling: Generating preference scores for LLM outputs to guide further fine-tuning.
- Response Quality Evaluation: Assessing the quality, safety, and reasoning capabilities of generated text.
- RLHF Integration: Serving as a reward model within complex RLHF systems to optimize LLM behavior.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.