swadeshb/Llama-3.2-3B-Instruct-AMPO-V1

Warm
Public
3.2B
BF16
32768
1
Dec 18, 2025
Hugging Face
Overview

Model Overview

swadeshb/Llama-3.2-3B-Instruct-AMPO-V1 is an instruction-tuned language model based on Meta's Llama-3.2-3B-Instruct architecture, featuring 3.2 billion parameters and a 32768 token context length. It was developed by swadeshb and fine-tuned using the TRL library.

Key Differentiator

The primary distinction of this model lies in its training methodology. It was fine-tuned using GRPO (Gradient Regularized Policy Optimization), a method introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This approach specifically aims to improve the model's capabilities in mathematical reasoning and complex problem-solving.

Training Details

  • Base Model: meta-llama/Llama-3.2-3B-Instruct
  • Fine-tuning Method: GRPO, as detailed in the DeepSeekMath paper.
  • Frameworks Used: TRL (version 0.23.0), Transformers (version 4.57.1), PyTorch (version 2.8.0+cu126), Datasets (version 3.3.2), Tokenizers (version 0.22.1).

Potential Use Cases

  • Mathematical Problem Solving: Ideal for applications requiring robust mathematical reasoning.
  • Logical Deduction: Suitable for tasks that benefit from enhanced logical processing.
  • Instruction Following: Benefits from its instruction-tuned base, making it responsive to user prompts.