The odats/rl_nmt_2026_04_11_13_31 model is a 1 billion parameter instruction-tuned language model, fine-tuned from google/gemma-3-1b-it. It was trained using the TRL framework and incorporates the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is particularly suited for tasks requiring improved reasoning, building upon its Gemma-3-1b-it base.
Loading preview...
Overview
This model, odats/rl_nmt_2026_04_11_13_31, is a 1 billion parameter language model derived from the google/gemma-3-1b-it architecture. It has been fine-tuned using the TRL framework to enhance its performance.
Key Capabilities
- Mathematical Reasoning: A primary differentiator of this model is its training with GRPO (Gradient-based Reward Policy Optimization), a method introduced in the DeepSeekMath paper. This suggests an optimization for tasks requiring mathematical reasoning.
- Instruction Following: As a fine-tuned version of an instruction-tuned model, it is designed to follow user instructions effectively.
Training Details
The model's training procedure utilized GRPO, a technique aimed at pushing the limits of mathematical reasoning in open language models. The training was conducted using specific versions of key frameworks:
- TRL: 1.0.0
- Transformers: 4.57.6
- Pytorch: 2.10.0
- Datasets: 4.8.4
- Tokenizers: 0.22.2
Good For
- Applications requiring enhanced mathematical reasoning.
- General instruction-following tasks where a smaller, efficient model is preferred.