The odats/rl_nmt_2026_04_12_13_17 model is a 1 billion parameter language model, fine-tuned from google/gemma-3-1b-it. It was trained using the TRL framework and the GRPO method, which is designed to enhance mathematical reasoning. This model is particularly suited for tasks requiring improved reasoning capabilities, building upon its Gemma base.
Loading preview...
Overview
This model, odats/rl_nmt_2026_04_12_13_17, is a 1 billion parameter language model derived from the google/gemma-3-1b-it architecture. It has been specifically fine-tuned using the TRL (Transformers Reinforcement Learning) framework.
Key Training Details
The model's training incorporated the GRPO (Gradient-based Reward Policy Optimization) method. GRPO is a technique introduced in the context of improving mathematical reasoning in large language models, as detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This suggests a focus on enhancing the model's logical and problem-solving abilities.
Potential Use Cases
- Reasoning-intensive tasks: Given its training with GRPO, the model may perform well in scenarios requiring logical deduction or mathematical problem-solving.
- Building upon Gemma-3-1b-it: Developers familiar with the base Gemma model can leverage this fine-tuned version for improved performance in specific reasoning domains.
Framework Versions
The training utilized TRL 1.1.0, Transformers 4.57.6, Pytorch 2.10.0, Datasets 4.8.4, and Tokenizers 0.22.2.