yapeichang/Qwen2.5-7B-BLEUBERI
yapeichang/Qwen2.5-7B-BLEUBERI is a 7.6 billion parameter language model based on the Qwen2.5 architecture, developed by Yapei Chang and collaborators. It utilizes BLEU, a simple n-gram matching metric, directly as a reward in GRPO training for instruction following. This model matches the performance of reward model-guided GRPO across general instruction-following benchmarks and produces factually grounded outputs.
Loading preview...
Model Overview
yapeichang/Qwen2.5-7B-BLEUBERI is a 7.6 billion parameter language model built upon the Qwen2.5 architecture. Developed by Yapei Chang and a team of researchers, this model introduces a novel approach to instruction following by leveraging the BLEU metric directly as a reward signal within the GRPO (Generalized Reinforcement Learning from Human Feedback) training framework.
Key Capabilities
- Instruction Following: Excels in general instruction-following tasks, demonstrating performance comparable to systems trained with more complex 8B and 27B reward models.
- Factual Grounding: Produces outputs that are noted for being more factually grounded, as rated by human evaluators.
- Efficient Reward Mechanism: Utilizes BLEU, a straightforward n-gram matching metric, which is shown to achieve human agreement levels similar to larger reward models when paired with high-quality references from strong LLMs.
Training Methodology
The core innovation of BLEUBERI lies in extending RLVR (Reinforcement Learning from Verifiable Rewards) to open-ended instruction following. The research found that BLEU, despite its simplicity, is surprisingly effective as a reward signal. This insight led to its direct application in GRPO training, matching the performance of RM-guided GRPO across four distinct instruction-following benchmarks.
Good For
- Applications requiring robust and factually grounded responses to general instructions.
- Scenarios where an efficient and effective reward mechanism for instruction following is desired, potentially reducing the computational overhead associated with larger reward models.