uc-rl/Qwen2.5-3B-UCRL

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Nov 7, 2025Architecture:Transformer Warm

uc-rl/Qwen2.5-3B-UCRL is a 3.1 billion parameter causal language model, fine-tuned from Qwen/Qwen2.5-3B-Instruct. Developed by uc-rl, this model specializes in mathematical reasoning and problem-solving, leveraging the GRPO training method. It is optimized for verifiable coding problems and tasks requiring robust logical deduction, offering a 32768-token context length.

Loading preview...

Overview

uc-rl/Qwen2.5-3B-UCRL is a 3.1 billion parameter language model, fine-tuned from the Qwen2.5-3B-Instruct base model. It has been specifically trained on the chansung/verifiable-coding-problems dataset to enhance its capabilities in mathematical reasoning and problem-solving.

Key Capabilities

  • Enhanced Mathematical Reasoning: The model's training with GRPO (Gradient-based Reward Policy Optimization) from the DeepSeekMath paper focuses on improving its ability to handle complex mathematical and logical tasks.
  • Verifiable Coding Problem Solving: Fine-tuning on a dataset of verifiable coding problems makes it particularly adept at generating and understanding code-related solutions that can be programmatically checked.
  • Instruction Following: Inherits strong instruction-following capabilities from its Qwen2.5-3B-Instruct base.
  • Extended Context: Supports a context length of 32768 tokens, allowing for processing longer prompts and more complex problem descriptions.

Training Details

The model was trained using the TRL (Transformer Reinforcement Learning) framework. The core training methodology, GRPO, is detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models," which aims to improve mathematical reasoning in large language models.