Name: yapeichang/Qwen2.5-7B-BLEUBERI API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: yapeichang

Model Overview

yapeichang/Qwen2.5-7B-BLEUBERI is a 7.6 billion parameter language model built upon the Qwen2.5 architecture. Developed by Yapei Chang and a team of researchers, this model introduces a novel approach to instruction following by leveraging the BLEU metric directly as a reward signal within the GRPO (Generalized Reinforcement Learning from Human Feedback) training framework.

Key Capabilities

Instruction Following: Excels in general instruction-following tasks, demonstrating performance comparable to systems trained with more complex 8B and 27B reward models.
Factual Grounding: Produces outputs that are noted for being more factually grounded, as rated by human evaluators.
Efficient Reward Mechanism: Utilizes BLEU, a straightforward n-gram matching metric, which is shown to achieve human agreement levels similar to larger reward models when paired with high-quality references from strong LLMs.

Training Methodology

The core innovation of BLEUBERI lies in extending RLVR (Reinforcement Learning from Verifiable Rewards) to open-ended instruction following. The research found that BLEU, despite its simplicity, is surprisingly effective as a reward signal. This insight led to its direct application in GRPO training, matching the performance of RM-guided GRPO across four distinct instruction-following benchmarks.

Good For

Applications requiring robust and factually grounded responses to general instructions.
Scenarios where an efficient and effective reward mechanism for instruction following is desired, potentially reducing the computational overhead associated with larger reward models.

Overview

Model Overview

Key Capabilities

Training Methodology

Good For

Full Model Card (README)