NotoriousH2/gemma-3-1b-it-Math-GRPO
NotoriousH2/gemma-3-1b-it-Math-GRPO is a 1 billion parameter Gemma-based instruction-tuned language model specifically optimized for Korean mathematical reasoning. It was trained using a three-stage pipeline: SFT, RS-SFT, and GRPO, targeting improved performance on mathematical problem-solving. The model achieves approximately 46.2% on the Korean GSM8K benchmark, demonstrating its specialized capability in mathematical tasks. Its 32768 token context length supports complex problem understanding.
Loading preview...
NotoriousH2/gemma-3-1b-it-Math-GRPO Overview
This model is a 1 billion parameter Gemma-based instruction-tuned language model developed by NotoriousH2, specifically engineered for Korean mathematical reasoning. It leverages a sophisticated three-stage training pipeline: Supervised Fine-Tuning (SFT), Rejection Sampling SFT (RS-SFT), and Generative Reinforcement Learning with Policy Optimization (GRPO).
Key Capabilities & Performance
- Specialized Math Reasoning: Optimized for solving mathematical problems in Korean.
- Benchmark Performance: Achieves approximately 46.2% on the Korean GSM8K evaluation (264 problems) and ~16.5% on the Korean MATH benchmark (577 problems).
- Advanced Training: Utilizes a GRPO stage, though the README notes that for this 1B model, GRPO did not provide significant improvement over the RS-SFT baseline, suggesting the model's capacity was already near optimal with SFT+RS-SFT.
- Context Length: Features a substantial 32768 token context window, beneficial for handling longer mathematical problems and complex instructions.
Training Methodology Highlights
- SFT \u2192 RS-SFT \u2192 GRPO Pipeline: A multi-stage approach to enhance instruction following and reasoning.
- Data Strategy: GRPO stage uses only prompts from 6,871 unique Korean GSM8K training problems, with the model generating its own solutions for reward calculation.
- DPO Analysis: The developers conducted extensive analysis on DPO (Direct Preference Optimization) failures, concluding that the 1B model lacked the capacity to discern subtle differences between correct and incorrect solutions for DPO effectiveness.
Good For
- Korean Mathematical Problem Solving: Ideal for applications requiring a compact model to perform arithmetic and reasoning tasks in Korean.
- Research into RLHF for Smaller Models: Provides insights into the limitations and effectiveness of advanced RL techniques like GRPO on 1B parameter models.