Model Overview
Starling-LM-7B-beta-laser-dpo is a 7 billion parameter language model developed by the Nexusflow Team, fine-tuned from Openchat-3.5-0106 (which is based on Mistral-7B-v0.1). This model utilizes Reinforcement Learning from AI Feedback (RLAIF) with the Nexusflow/Starling-RM-34B reward model and a policy optimization method based on PPO.
Key Differentiators
- Catastrophic Forgetting Prevention: Employs a novel laserRMT-inspired training technique that partially freezes the model. This method is designed to prevent the model from forgetting previously acquired knowledge, which is crucial for teaching specific skills like function calling.
- RLAIF Training: Trained using RLAIF with an upgraded reward model and policy tuning pipeline, leveraging the berkeley-nest/Nectar ranking dataset.
- Performance: Achieves an MT Bench score of 8.12, as evaluated by GPT-4.
Usage Considerations
- Chat Template: Requires the exact chat template as Openchat-3.5-0106 for optimal performance. This includes specific formatting for single-turn, multi-turn, and coding conversations.
- Verbosity: Model output can be verbose in rare cases; setting
temperature = 0 is suggested to mitigate this.
Good For
- Applications requiring a 7B parameter model with enhanced helpfulness and reduced harmlessness.
- Scenarios where preventing catastrophic forgetting of specific learned skills (e.g., function calling) is critical.
- Developers familiar with the Openchat-3.5-0106 chat template and usage patterns.