nlile/PE-7b-full
nlile/PE-7b-full is a 7 billion parameter language model fine-tuned from Stability AI's StableBeluga-7B. This model demonstrates strong performance in reward modeling, achieving a rewards/accuracies score of 0.9888 on its evaluation set. It is optimized for tasks requiring nuanced understanding of preferred responses, indicated by its high reward margins. The model is suitable for applications where distinguishing between desired and undesired outputs is critical.
Loading preview...
Model Overview
nlile/PE-7b-full is a 7 billion parameter language model derived from fine-tuning stabilityai/StableBeluga-7B. The model was trained for 3 epochs with a learning rate of 3e-07 and a total batch size of 64 across 8 GPUs. It exhibits a low final validation loss of 0.0066.
Key Performance Metrics
During its evaluation, PE-7b-full achieved notable results in reward modeling:
- Rewards/accuracies: 0.9888, indicating a high rate of correctly identifying preferred responses.
- Rewards/margins: 29.0043, demonstrating a significant difference between chosen and rejected reward scores.
- Rewards/chosen: -0.4634
- Rewards/rejected: -29.4677
These metrics suggest the model has developed a robust ability to differentiate between high-quality and low-quality outputs based on its reward signal.
Training Configuration
The model was trained using the following hyperparameters:
- Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
- LR Scheduler: Linear with 0.1 warmup ratio
- Frameworks: Transformers 4.35.0, PyTorch 2.1.1+cu121, Datasets 2.14.6, Tokenizers 0.14.1
Potential Use Cases
Given its strong reward modeling performance, nlile/PE-7b-full could be particularly effective in applications such as:
- Reinforcement Learning from Human Feedback (RLHF): As a reward model to guide the training of other generative models.
- Content Moderation: Identifying and filtering undesirable content based on learned preferences.
- Response Ranking: Scoring and ranking generated text responses for quality or relevance.