LichengLiu03/Qwen2.5-3B-UFO
LichengLiu03/Qwen2.5-3B-UFO is a 3.1 billion parameter language model based on Qwen2.5-3B-Instruct, fine-tuned with Proximal Policy Optimization (PPO) on the MetaMathQA dataset. It utilizes the Unary Feedback as Observation (UFO) framework to significantly improve multi-turn mathematical reasoning by learning from minimal "Try Again" feedback. This model excels at revising its reasoning across multiple attempts, making it particularly effective for complex math, logic, and reasoning tasks requiring iterative problem-solving.
Loading preview...
Overview of Qwen2.5-3B-UFO
LichengLiu03/Qwen2.5-3B-UFO is a 3.1 billion parameter model built upon the Qwen2.5-3B-Instruct architecture. Its core innovation lies in the Unary Feedback as Observation (UFO) framework, which addresses the challenge of multi-turn reasoning in LLMs. Traditional single-turn reinforcement learning models often fail to incorporate feedback effectively, repeating errors in interactive scenarios.
Key Capabilities & Differentiators
- Multi-Turn Reasoning: The UFO framework transforms static datasets into multi-turn training by treating minimal "Try Again" feedback as part of the observation, enabling the model to learn from historical mistakes and revise its reasoning iteratively.
- Enhanced Mathematical Performance: Trained with PPO on the MetaMathQA dataset, it shows a 14% improvement in multi-turn success rates and a 10% reduction in average interaction turns for mathematical problems compared to single-turn baselines.
- Answer Diversity: The model achieves 90% non-repetitive answers, significantly higher than the 80% baseline, due to a repetition penalty in its reward design.
- Efficient Problem Solving: An exponential reward decay mechanism encourages solving problems in fewer turns, leading to more efficient reasoning.
Good For
- Mathematical Reasoning: Optimized for complex math problems, logical reasoning, and accurate calculation steps.
- Interactive Problem Solving: Ideal for applications where models need to iteratively refine answers based on simple negative feedback.
- Learning from Sparse Feedback: Demonstrates effectiveness in scenarios where only minimal "Try Again" signals are available for improvement.