XiaoyuWen/MAGIC-Qwen2.5-7B-Instruct
XiaoyuWen/MAGIC-Qwen2.5-7B-Instruct is a 7.6 billion parameter instruction-tuned large language model developed by Xiaoyu Wen et al., based on the Qwen2.5-7B-Instruct architecture. This model is a 'defender' trained within the MAGIC co-evolving attacker-defender adversarial game framework, specifically designed to enhance robustness and safety against sophisticated harmful or policy-violating prompts. It excels at resisting jailbreak attempts and adaptive attacks while maintaining helpfulness, making it suitable for applications requiring high safety alignment.
Loading preview...
MAGIC-Qwen2.5-7B-Instruct: Robustness Through Adversarial Training
This model, developed by Xiaoyu Wen et al., is a 7.6 billion parameter instruction-tuned LLM built upon the Qwen2.5-7B-Instruct base. Its core differentiator is its training under the MAGIC (Co-Evolving Attacker-Defender Adversarial Game) framework. Unlike traditional safety alignment methods, MAGIC employs a dynamic game where an 'attacker' continuously generates increasingly complex harmful prompts, and a 'defender' (this model) iteratively adapts to resist these attacks.
Key Capabilities
- Enhanced Robustness: Significantly improved resistance against jailbreak attempts and policy-violating prompts.
- Adaptive Safety: Designed to generalize to unseen and evolving adversarial attacks through its co-evolutionary training process.
- Helpfulness Preservation: Maintains its utility and helpfulness while bolstering safety.
- Framework Innovation: Represents a novel approach to LLM safety alignment, moving beyond static red-teaming.
Good For
- Applications requiring high safety and robustness against adversarial prompting.
- Deployments where mitigating jailbreaks and harmful content generation is critical.
- Researchers and developers interested in advanced LLM safety alignment techniques.