Name: yapeichang/Qwen2.5-3B-RM8B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: yapeichang

Overview

yapeichang/Qwen2.5-3B-RM8B is a 3.1 billion parameter language model derived from the Qwen2.5 architecture. It was developed by Yapei Chang and a team of researchers, focusing on advancing instruction following capabilities. This model utilizes a novel approach called BLEUBERI, which leverages BLEU (a simple n-gram matching metric) as a direct reward in GRPO (Generative Reinforcement Learning with Policy Optimization) training, rather than relying solely on traditional reward models.

Key Capabilities

General Instruction Following: The model is specifically trained to excel at understanding and executing a wide range of open-ended instructions.
Factually Grounded Outputs: It is noted for producing responses that are more factually accurate compared to some other systems.
Efficient Training Method: Employs BLEU as a surprisingly effective reward signal, achieving human agreement comparable to larger 8B and 27B reward models on Chatbot Arena outputs.
Performance Parity: Matches the performance of traditional reward model-guided GRPO across four distinct instruction-following benchmarks.

Good For

Instruction-Following Applications: Ideal for tasks requiring the model to accurately follow complex or open-ended instructions.
Applications Requiring Factual Accuracy: Suitable for use cases where generating factually correct information is critical.
Research into Reward Mechanisms: Demonstrates an alternative, potentially more efficient, method for training instruction-following models without solely relying on large, complex reward models. The underlying research is detailed in the paper: BLEUBERI: BLEU is a surprisingly effective reward for instruction following.

Overview

Overview

Key Capabilities

Good For

Full Model Card (README)