mjf-su/FaithfulnessGuidanceReward

VISIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 22, 2026Architecture:Transformer Cold

The mjf-su/FaithfulnessGuidanceReward is a 4 billion parameter language model fine-tuned from mjf-su/PhysicalAI-reason-VLA-MetaAction-1e, utilizing the TRL framework. This model is specifically trained with GRPO (Guidance-based Reinforcement Learning for Policy Optimization), a method designed to enhance mathematical reasoning and faithfulness in language models, as introduced in the DeepSeekMath paper. With a context length of 32768 tokens, it is optimized for tasks requiring robust and faithful reasoning capabilities.

Loading preview...

Model Overview

The mjf-su/FaithfulnessGuidanceReward is a 4 billion parameter language model, fine-tuned from the mjf-su/PhysicalAI-reason-VLA-MetaAction-1e base model. It leverages the TRL (Transformer Reinforcement Learning) framework for its training process.

Key Capabilities & Training

This model's core differentiator lies in its training methodology: it was developed using GRPO (Guidance-based Reinforcement Learning for Policy Optimization). This technique, detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300), aims to significantly improve the model's mathematical reasoning and faithfulness in its responses. The training process can be visualized via Weights & Biases, indicating a focus on robust and reliable output generation.

Technical Specifications

  • Parameters: 4 Billion
  • Context Length: 32768 tokens
  • Frameworks: TRL (0.26.1), Transformers (4.57.6), Pytorch (2.10.0), Datasets (4.4.1), Tokenizers (0.22.1)

When to Use This Model

This model is particularly well-suited for applications where:

  • Faithful and accurate reasoning is paramount.
  • Tasks involve mathematical problem-solving or require logical consistency.
  • You need a model that has been specifically optimized for guidance-based reinforcement learning to produce more reliable outputs.