Snorkel-Mistral-PairRM-DPO: Iterative Alignment for Instruction Following
Snorkel-Mistral-PairRM-DPO is a 7 billion parameter language model developed by Snorkel AI, built upon the Mistral-7B-Instruct-v0.2 base. This model distinguishes itself through an iterative alignment methodology using Direct Preference Optimization (DPO) and the PairRM reward model. The process involved generating multiple response variations, reranking them with PairRM, and then applying DPO on chosen and rejected responses over three iterations.
Key Capabilities & Methodology
- Iterative DPO Alignment: Utilizes a novel iterative DPO process, starting with Mistral-7B-Instruct-v0.2, to progressively align the model.
- PairRM Integration: Employs the performant PairRM as a general-purpose reward model for response reranking, avoiding external LLM responses in the training dataset.
- Strong Instruction Following: Achieved a score of 30.22 on the Alpaca-Eval 2.0 benchmark, ranking 3rd overall and highest among open-source base models at the time of publication. This is a significant improvement over the base model's 14.72 score.
- Chat Optimization: Specifically optimized for chat-based interactions and general instruction following.
Good For
- General-purpose chat applications requiring robust instruction adherence.
- Developers interested in DPO and reward model-based alignment techniques for LLMs.
- Benchmarking and research into iterative alignment strategies.
This model serves as a demonstration of programmatically aligning LLMs using smaller, specialized reward models, highlighting a scalable approach to improving instruction following without extensive manual annotation.