pixas/Miner-4B
Miner-4B is a 4 billion parameter reasoning model developed by pixas, trained with the MINER reinforcement learning method. This method enhances data efficiency for large reasoning models by leveraging intrinsic uncertainty as a self-supervised reward signal. It is specifically designed to improve performance on reasoning and problem-solving tasks, particularly in scenarios where standard RL methods are inefficient. The model is intended for research and experimental use in areas like mathematical reasoning and RL for language models.
Loading preview...
Overview
Miner-4B is a 4 billion parameter reasoning model developed by pixas, utilizing the MINER (Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models) framework. This model addresses the data inefficiency of critic-free reinforcement learning methods, especially when all sampled rollouts are correct and provide limited learning signals. MINER introduces two core concepts: token-level focal credit assignment to amplify learning on uncertain tokens and adaptive advantage calibration for stable integration of intrinsic and verifiable rewards.
Key Capabilities
- Enhanced Reasoning: Improves performance on complex reasoning and problem-solving tasks.
- Data-Efficient RL: Leverages intrinsic uncertainty for self-supervised reward signals, reducing reliance on auxiliary reward models.
- Robust Training: Evaluated on six reasoning benchmarks, demonstrating stronger sample efficiency and accuracy compared to baseline GRPO variants.
Intended Use Cases
- Research and Experimentation: Ideal for academic research on RL for reasoning models.
- Mathematical and Verifiable Reasoning: Suitable for tasks requiring precise logical deduction.
- Model Evaluation: Useful for evaluating reasoning benchmarks and conducting ablation studies based on the MINER framework.
- Further Finetuning: Serves as a checkpoint for additional finetuning or post-training efforts.