dipta007/GanitLLM-4B_SFT_CGRPO

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Jan 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

dipta007/GanitLLM-4B_SFT_CGRPO is a 4 billion parameter causal language model developed by dipta007, based on Qwen/Qwen3-4B, with a 4,096 token context length. It is specifically fine-tuned for Bengali mathematical reasoning using Supervised Fine-Tuning (SFT) and a novel Curriculum-GRPO approach. This model significantly improves accuracy on Bengali mathematical benchmarks while generating more concise, Bengali-centric solutions compared to its base model. Its primary strength lies in solving mathematical problems in Bengali with high accuracy and efficient output.

Loading preview...

GanitLLM-4B_SFT_CGRPO: Bengali Mathematical Reasoning Model

GanitLLM-4B_SFT_CGRPO is a 4 billion parameter causal language model built upon the Qwen/Qwen3-4B architecture, specifically optimized for mathematical reasoning in Bengali. Developed by dipta007, this model leverages a unique multi-stage training pipeline involving Supervised Fine-Tuning (SFT) and a novel Curriculum-GRPO (Curriculum-Guided Reinforcement Learning with Policy Optimization) approach.

Key Capabilities & Performance

  • Enhanced Bengali Mathematical Reasoning: Achieves significant improvements on Bengali mathematical benchmarks, with a +7.6 accuracy on Bn-MGSM (76.8%) and +5.9 accuracy on Bn-MSVAMP (76.4%) compared to the base Qwen3-4B model.
  • Bengali-Centric Reasoning: Demonstrates 88.71% Bengali reasoning in its solutions, a substantial increase from the base model's 14.79%.
  • Concise Solutions: Generates solutions with 79.5% fewer tokens (193 words vs. 943 words) while maintaining high accuracy.
  • Context Length: Supports a context length of 4,096 tokens.

Training Methodology

The model was trained using:

  1. Supervised Fine-Tuning (SFT): Initial training on the GANIT-SFT dataset (~11k examples) to establish foundational reasoning in Bengali.
  2. Curriculum-GRPO: Subsequent reinforcement learning on the GANIT-RLVR dataset (~7.3k examples), incorporating difficulty-aware sampling and specific reward functions for format, correctness (Bengali and English answers), and ensuring a high percentage of Bengali text in the reasoning steps.

Use Cases

This model is ideal for applications requiring accurate and efficient mathematical problem-solving in Bengali, particularly where concise and culturally relevant explanations are valued. It can be integrated into educational tools, intelligent tutoring systems, or any platform needing robust Bengali mathematical reasoning capabilities.