Overview
pixas/Miner-8B is an 8 billion parameter reasoning model developed by pixas, specifically trained using the MINER (Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models) framework. MINER is a reinforcement learning method designed to address the data inefficiency of critic-free RL on prompts where all sampled rollouts are correct, providing little learning signal. It achieves this by utilizing the policy's intrinsic uncertainty as a self-supervised reward, eliminating the need for auxiliary reward models or additional inference-time overhead.
Key Innovations
The MINER framework introduces two core concepts:
- Token-level focal credit assignment: This mechanism amplifies learning for uncertain and critical tokens while suppressing overconfident ones.
- Adaptive advantage calibration: This integrates intrinsic and verifiable rewards in a stable manner.
Performance & Evaluation
Evaluated on six reasoning benchmarks, MINER demonstrates stronger sample efficiency and accuracy compared to several baseline methods, including GRPO variants. The model is a research checkpoint and its performance may vary depending on the base model, data mixture, and evaluation pipeline used.
Intended Use Cases
This model is primarily intended for research and experimental use in:
- Reasoning and problem-solving tasks.
- Reinforcement learning for language models.
- Mathematical and verifiable reasoning.
- Post-training and evaluation of large reasoning models.
Potential applications include academic research, evaluation on reasoning benchmarks, and further finetuning based on the MINER framework. Users should be aware of potential limitations, such as producing incorrect or incomplete reasoning outputs, and the model's performance being sensitive to prompt format and decoding setup.