SII-Enigma/Qwen2.5-7B-Ins-AMPO
The SII-Enigma/Qwen2.5-7B-Ins-AMPO is a 7.6 billion parameter instruction-tuned language model based on the Qwen2.5 architecture, developed by SII-Enigma. It utilizes a novel Adaptive Multi-Guidance Policy Optimization (AMPO) framework, which intelligently leverages diverse teacher models to enhance reasoning efficiency and learning effectiveness. This model is specifically designed to improve performance by intervening with external guidance only when the on-policy model fails, making it suitable for complex reasoning tasks.
Loading preview...
Overview
SII-Enigma/Qwen2.5-7B-Ins-AMPO is a 7.6 billion parameter instruction-tuned model built upon the Qwen2.5 architecture, developed by SII-Enigma. It introduces the Adaptive Multi-Guidance Policy Optimization (AMPO) framework, a novel approach that enhances model performance and efficiency by strategically integrating knowledge from multiple teacher models. Unlike traditional methods, AMPO intervenes with external guidance only when the on-policy model encounters difficulties, preserving the model's self-discovery capabilities while boosting reasoning.
Key Capabilities
- Adaptive Multi-Guidance Replacement: Minimizes external intervention, providing guidance only upon complete on-policy failure to maintain self-discovery and improve reasoning efficiency.
- Comprehension-based Guidance Selection: Optimizes learning by guiding the model to assimilate the most comprehensible external solutions, leading to demonstrably improved performance.
- Superior Performance: Achieves enhanced performance and efficiency compared to models trained solely with Reinforcement Learning (RL) or Supervised Fine-Tuning (SFT).
- Multi-Guidance Pool: Leverages a diverse set of teacher models, including AceReason-Nemotron-1.1-7B, DeepSeek-R1-Distill-Qwen-7B, OpenR1-Qwen-7B, and Qwen3-8B(thinking), to provide robust external knowledge.
Good For
- Complex Reasoning Tasks: Excels in scenarios requiring intricate problem-solving and logical deduction, benefiting from its adaptive guidance mechanism.
- Efficiency-focused Applications: Offers improved efficiency by selectively applying external knowledge, reducing unnecessary computational overhead.
- Research and Development: Provides a strong foundation for further exploration into multi-teacher learning and adaptive policy optimization techniques, as detailed in its associated paper.