yapeichang/Qwen2.5-3B-RM8B
TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Jun 5, 2025License:apache-2.0Architecture:Transformer Open Weights Cold

The yapeichang/Qwen2.5-3B-RM8B model is a 3.1 billion parameter language model based on the Qwen2.5 architecture, fine-tuned using GRPO training with a Skywork-RM-8B reward model. Developed by Yapei Chang and collaborators, this model is specifically optimized for general instruction following tasks. It demonstrates performance comparable to reward model-guided GRPO systems across various instruction-following benchmarks, producing factually grounded outputs.

Loading preview...

Overview

yapeichang/Qwen2.5-3B-RM8B is a 3.1 billion parameter language model derived from the Qwen2.5 architecture. It was developed by Yapei Chang and a team of researchers, focusing on advancing instruction following capabilities. This model utilizes a novel approach called BLEUBERI, which leverages BLEU (a simple n-gram matching metric) as a direct reward in GRPO (Generative Reinforcement Learning with Policy Optimization) training, rather than relying solely on traditional reward models.

Key Capabilities

  • General Instruction Following: The model is specifically trained to excel at understanding and executing a wide range of open-ended instructions.
  • Factually Grounded Outputs: It is noted for producing responses that are more factually accurate compared to some other systems.
  • Efficient Training Method: Employs BLEU as a surprisingly effective reward signal, achieving human agreement comparable to larger 8B and 27B reward models on Chatbot Arena outputs.
  • Performance Parity: Matches the performance of traditional reward model-guided GRPO across four distinct instruction-following benchmarks.

Good For

  • Instruction-Following Applications: Ideal for tasks requiring the model to accurately follow complex or open-ended instructions.
  • Applications Requiring Factual Accuracy: Suitable for use cases where generating factually correct information is critical.
  • Research into Reward Mechanisms: Demonstrates an alternative, potentially more efficient, method for training instruction-following models without solely relying on large, complex reward models. The underlying research is detailed in the paper: BLEUBERI: BLEU is a surprisingly effective reward for instruction following.