Basher17/VibeThinker-3B-MLX
VibeThinker-3B-MLX is a 3.1 billion parameter model from Basher17, part of the VibeThinker series, specifically optimized for challenging reasoning tasks in mathematics, coding, and STEM. It achieves strong performance on verifiable reasoning benchmarks like AIME, HMMT, IMO-AnswerBench, and LiveCodeBench, reaching the performance range of significantly larger frontier models. This model is particularly suited for competitive programming problems and other tasks where answers can be clearly verified.
Loading preview...
Overview
VibeThinker-3B is a 3.1 billion parameter model developed by WeiboAI, focusing on advanced reasoning tasks in mathematics, coding, and STEM. It systematically optimizes the Spectrum-to-Signal Principle (SSP) post-training pipeline, enabling it to achieve performance comparable to much larger frontier reasoning models on verifiable benchmarks. The model demonstrates that compact models can achieve near-frontier reasoning capabilities in structured task spaces with reliable feedback signals.
Key Capabilities
- Exceptional Reasoning: Achieves 76.4 on IMO-AnswerBench (80.6 with CLR), a benchmark of 400 International Mathematical Olympiad-level problems, outperforming models like DeepSeek V3.2 (671B) and GLM-5 (744B) in relative accuracy to scale.
- Competitive Programming Prowess: Passed 123 out of 128 first-attempt submissions (96.1% acceptance rate) on recent unseen LeetCode weekly and biweekly contests (Python).
- Robust Training: Utilizes a multi-stage training pipeline including curriculum-based two-stage Supervised Fine-Tuning (SFT), Multi-domain Reasoning Reinforcement Learning (RL), Offline Self-Distillation, and Instruct RL to enhance reasoning and controllability.
Good For
- Competitive Programming: Excels at LeetCode-style problems and similar coding challenges.
- Hard Math & STEM Reasoning: Ideal for tasks requiring multi-step reasoning, constraint satisfaction, and answer verification in mathematics and science.
- Benchmark Evaluation: Recommended for evaluating against challenging datasets like AMOBench for harder math reasoning. Not recommended for tool-calling, API orchestration, or autonomous coding agents.