jieliu/Storm-7B
Storm-7B is a 7 billion parameter language model developed by Jie Liu, Zhanhui Zhou, Jiaheng Liu, Xingyuan Bu, Chao Yang, Han-Sen Zhong, and Wanli Ouyang, fine-tuned from openchat-3.5-0106. It utilizes iterative length-regularized Direct Preference Optimization (iLR-DPO) to achieve performance comparable to GPT-4 on AlpacaEval 2.0, specifically optimized to improve response quality without increasing verbosity. This model is designed for high-quality, concise conversational AI applications, demonstrating a 50.5% length-controlled win rate against GPT-4 Preview.
Loading preview...
Storm-7B: GPT-4 Level Performance in a 7B Model
Storm-7B is an open-source 7 billion parameter language model developed by Jie Liu and collaborators, fine-tuned from openchat-3.5-0106. It introduces iterative length-regularized Direct Preference Optimization (iLR-DPO), a novel training approach that addresses the common pitfall of increased verbosity in iterative DPO methods. By penalizing response length during training, iLR-DPO enhances response quality and alignment with human values without making the model more verbose.
Key Capabilities & Performance
- GPT-4 Level Performance: Achieves a 50.5% length-controlled win rate against GPT-4 Preview on the AlpacaEval 2.0 leaderboard, making it the first open-source model to surpass GPT-4 Preview in this metric.
- Verbosity Control: iLR-DPO ensures that improvements in response quality do not lead to increased response length, maintaining conciseness.
- Enhanced Decoding: With beam search, the model shows a 5% improvement over regular decoding. Best-of-n sampling with the Starling-RM-34B reward model achieves a 61.6% LC Win rate, outperforming GPT-4 Omni.
- Maintained NLP Performance: The model shows no significant degradation on traditional NLP tasks, as indicated by the Huggingface Open LLM Leaderboard.
Use Cases & Limitations
Storm-7B is well-suited for conversational AI applications requiring high-quality, concise responses. It uses the same chat template as Openchat-3.5-0106.
Limitations include reliance on GPT-4 as a proxy for human judgment in alignment and the use of a length penalty rather than a direct verbosity reward model. Future work may explore training a specific reward model for verbosity.