yapeichang/Qwen2.5-7B-BLEUBERI

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:May 27, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

yapeichang/Qwen2.5-7B-BLEUBERI is a 7.6 billion parameter language model based on the Qwen2.5 architecture, developed by Yapei Chang and collaborators. It utilizes BLEU, a simple n-gram matching metric, directly as a reward in GRPO training for instruction following. This model matches the performance of reward model-guided GRPO across general instruction-following benchmarks and produces factually grounded outputs.

Loading preview...

Model Overview

yapeichang/Qwen2.5-7B-BLEUBERI is a 7.6 billion parameter language model built upon the Qwen2.5 architecture. Developed by Yapei Chang and a team of researchers, this model introduces a novel approach to instruction following by leveraging the BLEU metric directly as a reward signal within the GRPO (Generalized Reinforcement Learning from Human Feedback) training framework.

Key Capabilities

  • Instruction Following: Excels in general instruction-following tasks, demonstrating performance comparable to systems trained with more complex 8B and 27B reward models.
  • Factual Grounding: Produces outputs that are noted for being more factually grounded, as rated by human evaluators.
  • Efficient Reward Mechanism: Utilizes BLEU, a straightforward n-gram matching metric, which is shown to achieve human agreement levels similar to larger reward models when paired with high-quality references from strong LLMs.

Training Methodology

The core innovation of BLEUBERI lies in extending RLVR (Reinforcement Learning from Verifiable Rewards) to open-ended instruction following. The research found that BLEU, despite its simplicity, is surprisingly effective as a reward signal. This insight led to its direct application in GRPO training, matching the performance of RM-guided GRPO across four distinct instruction-following benchmarks.

Good For

  • Applications requiring robust and factually grounded responses to general instructions.
  • Scenarios where an efficient and effective reward mechanism for instruction following is desired, potentially reducing the computational overhead associated with larger reward models.