Storm-7B: GPT-4 Level Performance in a 7B Model
Storm-7B is an open-source 7 billion parameter language model developed by Jie Liu and collaborators, fine-tuned from openchat-3.5-0106. It introduces iterative length-regularized Direct Preference Optimization (iLR-DPO), a novel training approach that addresses the common pitfall of increased verbosity in iterative DPO methods. By penalizing response length during training, iLR-DPO enhances response quality and alignment with human values without making the model more verbose.
Key Capabilities & Performance
- GPT-4 Level Performance: Achieves a 50.5% length-controlled win rate against GPT-4 Preview on the AlpacaEval 2.0 leaderboard, making it the first open-source model to surpass GPT-4 Preview in this metric.
- Verbosity Control: iLR-DPO ensures that improvements in response quality do not lead to increased response length, maintaining conciseness.
- Enhanced Decoding: With beam search, the model shows a 5% improvement over regular decoding. Best-of-n sampling with the Starling-RM-34B reward model achieves a 61.6% LC Win rate, outperforming GPT-4 Omni.
- Maintained NLP Performance: The model shows no significant degradation on traditional NLP tasks, as indicated by the Huggingface Open LLM Leaderboard.
Use Cases & Limitations
Storm-7B is well-suited for conversational AI applications requiring high-quality, concise responses. It uses the same chat template as Openchat-3.5-0106.
Limitations include reliance on GPT-4 as a proxy for human judgment in alignment and the use of a length penalty rather than a direct verbosity reward model. Future work may explore training a specific reward model for verbosity.