mjf-su/ADEnReward-ReasoningConfidenceReward
The mjf-su/ADEnReward-ReasoningConfidenceReward is a 4 billion parameter language model, fine-tuned from mjf-su/PhysicalAI-reason-VLA-MetaAction-1e. This model was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities in language models. It is optimized for tasks requiring robust reasoning and confidence assessment, building upon its base model's foundation.
Loading preview...
ADEnReward-ReasoningConfidenceReward Overview
This model, developed by mjf-su, is a 4 billion parameter language model fine-tuned from the mjf-su/PhysicalAI-reason-VLA-MetaAction-1e base model. It leverages the GRPO (Generative Reinforcement Learning with Policy Optimization) method, as introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300), to enhance its reasoning and confidence assessment abilities.
Key Capabilities
- Enhanced Reasoning: Benefits from GRPO training, which is specifically designed to improve mathematical and general reasoning in language models.
- Fine-tuned Performance: Builds upon the capabilities of its base model,
mjf-su/PhysicalAI-reason-VLA-MetaAction-1e, with further optimization for specific reasoning tasks. - Context Length: Supports a substantial context length of 32768 tokens, allowing for processing longer inputs and maintaining coherence over extended interactions.
Training Details
The model was trained using the TRL (Transformer Reinforcement Learning) framework, with specific versions including TRL 0.26.1, Transformers 4.57.6, and Pytorch 2.10.0. The training process can be visualized via Weights & Biases, indicating a structured and monitored development approach.
Good For
- Applications requiring improved reasoning capabilities, particularly in areas where mathematical or logical inference is crucial.
- Scenarios where confidence assessment in model outputs is beneficial.
- Tasks that can leverage a large context window for complex problem-solving or detailed analysis.