Kazuki1450/Llama-3.2-3B-Instruct_geo_3_6_clean_1p0_0p0_1p0_grpo_42_rule

TEXT GENERATIONConcurrency Cost:1Model Size:3.2BQuant:BF16Ctx Length:32kPublished:Mar 16, 2026Architecture:Transformer Cold

Kazuki1450/Llama-3.2-3B-Instruct_geo_3_6_clean_1p0_0p0_1p0_grpo_42_rule is a 3.2 billion parameter instruction-tuned language model, fine-tuned from meta-llama/Llama-3.2-3B-Instruct. This model was trained using the GRPO method, as introduced in the DeepSeekMath paper, which focuses on mathematical reasoning. It is optimized for tasks requiring enhanced reasoning capabilities, particularly in mathematical contexts, and has a context length of 32768 tokens.

Loading preview...

Model Overview

This model, developed by Kazuki1450, is an instruction-tuned variant of the meta-llama/Llama-3.2-3B-Instruct base model, featuring 3.2 billion parameters and a context length of 32768 tokens. It was fine-tuned using the TRL framework.

Key Differentiator: GRPO Training

A significant aspect of this model is its training methodology. It leverages GRPO (Gradient-based Reward Policy Optimization), a technique highlighted in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This suggests an optimization for tasks that benefit from advanced reasoning, particularly in mathematical domains.

Capabilities

  • Instruction Following: Designed to respond to user instructions effectively, building upon its Llama-3.2-3B-Instruct foundation.
  • Enhanced Reasoning: The application of the GRPO training procedure implies improved performance in tasks requiring logical and mathematical reasoning.

Usage

This model is suitable for applications where a compact yet capable instruction-tuned model with a focus on reasoning is beneficial. Developers can integrate it using the Hugging Face transformers library for text generation tasks.