odats/rl_nmt_2026_04_11_13_31
TEXT GENERATIONConcurrency Cost:1Model Size:1BQuant:BF16Ctx Length:32kPublished:Apr 11, 2026Architecture:Transformer Cold

The odats/rl_nmt_2026_04_11_13_31 model is a 1 billion parameter instruction-tuned language model, fine-tuned from google/gemma-3-1b-it. It was trained using the TRL framework and incorporates the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is particularly suited for tasks requiring improved reasoning, building upon its Gemma-3-1b-it base.

Loading preview...

Overview

This model, odats/rl_nmt_2026_04_11_13_31, is a 1 billion parameter language model derived from the google/gemma-3-1b-it architecture. It has been fine-tuned using the TRL framework to enhance its performance.

Key Capabilities

  • Mathematical Reasoning: A primary differentiator of this model is its training with GRPO (Gradient-based Reward Policy Optimization), a method introduced in the DeepSeekMath paper. This suggests an optimization for tasks requiring mathematical reasoning.
  • Instruction Following: As a fine-tuned version of an instruction-tuned model, it is designed to follow user instructions effectively.

Training Details

The model's training procedure utilized GRPO, a technique aimed at pushing the limits of mathematical reasoning in open language models. The training was conducted using specific versions of key frameworks:

  • TRL: 1.0.0
  • Transformers: 4.57.6
  • Pytorch: 2.10.0
  • Datasets: 4.8.4
  • Tokenizers: 0.22.2

Good For

  • Applications requiring enhanced mathematical reasoning.
  • General instruction-following tasks where a smaller, efficient model is preferred.