Overview
Overview of Qwen2.5-3B-UFO
LichengLiu03/Qwen2.5-3B-UFO is a 3.1 billion parameter model built upon the Qwen2.5-3B-Instruct architecture. Its core innovation lies in the Unary Feedback as Observation (UFO) framework, which addresses the challenge of multi-turn reasoning in LLMs. Traditional single-turn reinforcement learning models often fail to incorporate feedback effectively, repeating errors in interactive scenarios.
Key Capabilities & Differentiators
- Multi-Turn Reasoning: The UFO framework transforms static datasets into multi-turn training by treating minimal "Try Again" feedback as part of the observation, enabling the model to learn from historical mistakes and revise its reasoning iteratively.
- Enhanced Mathematical Performance: Trained with PPO on the MetaMathQA dataset, it shows a 14% improvement in multi-turn success rates and a 10% reduction in average interaction turns for mathematical problems compared to single-turn baselines.
- Answer Diversity: The model achieves 90% non-repetitive answers, significantly higher than the 80% baseline, due to a repetition penalty in its reward design.
- Efficient Problem Solving: An exponential reward decay mechanism encourages solving problems in fewer turns, leading to more efficient reasoning.
Good For
- Mathematical Reasoning: Optimized for complex math problems, logical reasoning, and accurate calculation steps.
- Interactive Problem Solving: Ideal for applications where models need to iteratively refine answers based on simple negative feedback.
- Learning from Sparse Feedback: Demonstrates effectiveness in scenarios where only minimal "Try Again" signals are available for improvement.