SII-Enigma/Llama3.2-8B-Ins-AMPO
SII-Enigma/Llama3.2-8B-Ins-AMPO is an 8 billion parameter instruction-tuned language model based on the Llama 3.2 architecture, featuring a 32768 token context length. It incorporates the Adaptive Multi-Guidance Policy Optimization (AMPO) framework, which leverages diverse teacher models to enhance reasoning efficiency and performance. This model is designed to improve learning effectiveness by selectively integrating external solutions, outperforming traditional RL or SFT methods.
Loading preview...
SII-Enigma/Llama3.2-8B-Ins-AMPO: Adaptive Multi-Guidance Policy Optimization
This model, developed by SII-Enigma, is an 8 billion parameter instruction-tuned variant of the Llama 3.2 architecture, featuring a substantial 32768 token context length. Its core innovation lies in the Adaptive Multi-Guidance Policy Optimization (AMPO) framework, a novel approach that intelligently integrates guidance from multiple, diverse teacher models.
Key Capabilities & Innovations
- Adaptive Multi-Guidance Replacement: AMPO minimizes intervention by providing external guidance only when the on-policy model fails completely, preserving the model's self-discovery capabilities while significantly enhancing reasoning efficiency.
- Comprehension-based Guidance Selection: This mechanism improves learning effectiveness by guiding the model to assimilate the most comprehensible external solutions, leading to demonstrably boosted performance.
- Superior Performance: The model achieves better overall performance and efficiency compared to models trained using only Reinforcement Learning (RL) or Supervised Fine-Tuning (SFT) methods.
- Multi-Guidance Pool: It leverages a diverse set of teacher models, including AceReason-Nemotron-1.1-7B, DeepSeek-R1-Distill-Qwen-7B, OpenR1-Qwen-7B, and Qwen3-8B (thinking), to provide robust external knowledge.
Use Cases
This model is particularly well-suited for tasks requiring enhanced reasoning and problem-solving, where leveraging external knowledge efficiently can lead to more accurate and robust outputs. Its adaptive guidance system makes it effective in scenarios where traditional fine-tuning might fall short, offering a more dynamic learning approach.