olaverse/MIST-Mini-8B-Thinking
MIST-Mini-8B-Thinking by olaverse is an 8 billion parameter reasoning-focused language model, derived from the MIST-Mini-8B family. It is specifically trained with 4 phases of Group Relative Policy Optimization (GRPO) reinforcement learning to explicitly show its step-by-step thought process using tags before providing an answer. This model excels in mathematical reasoning, achieving 95% accuracy on GSM8K, and is designed for transparent and verifiable problem-solving on consumer GPUs.
Loading preview...
MIST-Mini-8B-Thinking Overview
MIST-Mini-8B-Thinking, developed by olaverse, is an 8 billion parameter language model specialized in transparent reasoning. It is a variant of the MIST-Mini-8B model, uniquely fine-tuned through a 4-phase Group Relative Policy Optimization (GRPO) reinforcement learning process. This training enables the model to articulate its thought process within <think> tags before delivering a final answer, making its reasoning verifiable and trustworthy.
Key Capabilities and Features
- Transparent Reasoning: Explicitly shows step-by-step thinking using
<think>tags, allowing users to follow and verify the model's logic. - Strong Mathematical Performance: Achieves 95% accuracy on the GSM8K benchmark after its specialized training, indicating robust math problem-solving abilities.
- Efficient and Accessible: As an 8B parameter model, it is designed to run efficiently on consumer-grade GPUs, with 4-bit quantized versions fitting on as little as 6GB of VRAM.
- GRPO Training: Utilizes a sophisticated reinforcement learning approach with specific reward functions for correct answers, structured reasoning steps, and proper use of the
<think>format.
Ideal Use Cases
- Educational Tools: For applications requiring step-by-step explanations in math or logic.
- Problem Solving: Scenarios where not just the answer, but also the method to arrive at it, is crucial.
- Auditable AI: Use cases demanding transparency and verifiability of the model's decision-making process.
- Resource-Constrained Environments: Its efficient size allows deployment on consumer hardware.