Name: Mohith202/brainrl-grpo-single-m API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Mohith202

Model Overview

Mohith202/brainrl-grpo-single-m is a 0.5 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-0.5B-Instruct base model. This model leverages the GRPO (Gradient-based Reinforcement Learning with Policy Optimization) training method, which was originally introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). The fine-tuning process was conducted using the Hugging Face TRL (Transformer Reinforcement Learning) library.

Key Capabilities

Enhanced Reasoning: The application of the GRPO method suggests an optimization for tasks requiring more robust reasoning, particularly in areas like mathematics, as indicated by the method's origin.
Instruction Following: As a fine-tuned version of an instruction-tuned model, it is designed to follow user instructions effectively.
Extended Context: With a context length of 32768 tokens, it can process and generate longer sequences of text, beneficial for complex queries or multi-turn conversations.

Training Details

The model was trained using the TRL framework (version 0.19.1) with Transformers (4.53.3), PyTorch (2.4.1+cu121), and Datasets (4.8.4). The core innovation lies in the adoption of the GRPO training procedure, aiming to improve performance in specific domains, likely mathematical or logical reasoning, similar to its foundational research.

Good for

Experimentation with GRPO-trained models.
Tasks requiring mathematical reasoning or logical problem-solving, given the training methodology.
Applications benefiting from a model with a substantial context window.

Overview

Model Overview

Key Capabilities

Training Details

Good for

Full Model Card (README)