openbmb/RLPR-Llama3.1-8B-Inst

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Jun 22, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

The openbmb/RLPR-Llama3.1-8B-Inst model is an 8 billion parameter instruction-tuned language model developed by OpenBMB, built upon Llama3.1-8B-Instruct. It utilizes the RLPR framework for verifier-free reasoning enhancement, leveraging intrinsic generation probability as a direct reward signal. This model excels in general and mathematical reasoning tasks, demonstrating substantial improvements over baselines without requiring external verifiers or specialized fine-tuning.

Loading preview...

Model Overview

openbmb/RLPR-Llama3.1-8B-Inst is an 8 billion parameter instruction-tuned model developed by OpenBMB, based on the Llama3.1-8B-Instruct architecture. Its core innovation lies in the RLPR (Reinforcement Learning with Probability-based Reward) framework, which enhances reasoning capabilities without relying on external verifiers.

Key Capabilities & Innovations

  • Verifier-Free Reasoning Enhancement: RLPR uses the LLM's intrinsic generation probability as a direct reward signal, eliminating the need for external verifiers and specialized fine-tuning. This makes it broadly applicable to complex and diverse reasoning tasks.
  • Innovative Reward & Training Framework:
    • Employs a Probability-based Reward (PR) system, utilizing average decoding probabilities of reference answers to generate higher quality, debiased reward signals, outperforming simpler sequence likelihood methods.
    • Features a standard deviation filtering mechanism that dynamically filters prompts, stabilizing training and significantly boosting performance.
  • Strong Performance: Demonstrates substantial improvements in both general and mathematical reasoning benchmarks, surpassing the RLVR baseline by an average of 1.4 points across seven benchmarks.

Training Details

When to Use This Model

This model is particularly well-suited for applications requiring enhanced reasoning capabilities, especially in scenarios where external verifiers are impractical or unavailable. Its verifier-free approach makes it a versatile choice for complex problem-solving and mathematical tasks.