Model Overview
The thu-coai/Mistral-7B-Instruct-v0.2-safeunlearning is a 7 billion parameter instruction-tuned language model based on the original Mistral-7B-Instruct-v0.2. This version has been specifically modified by thu-coai through a "safe unlearning" process, targeting 100 raw harmful questions during its training. The primary goal of this unlearning was to significantly improve the model's resilience against various jailbreak attacks, making it a safer option for deployment in sensitive applications.
Key Capabilities
- Enhanced Safety: Demonstrates significantly improved resistance to jailbreak attempts compared to its base model.
- Performance Preservation: Maintains general performance levels comparable to the original Mistral-7B-Instruct-v0.2, ensuring its utility across a broad range of tasks.
- Instruction Following: Retains strong instruction-following capabilities, consistent with the base Mistral-7B-Instruct-v0.2.
- Standard Prompt Format: Utilizes the same prompt format as the original Mistral-7B-Instruct-v0.2, allowing for seamless integration into existing workflows.
Good For
- Applications requiring high safety: Ideal for chatbots, virtual assistants, and other conversational AI systems where mitigating harmful outputs and jailbreaks is critical.
- Research into model safety and unlearning: Provides a practical example of safe unlearning techniques applied to a large language model.
- General instruction-following tasks: Suitable for a wide array of natural language processing tasks where the base Mistral-7B-Instruct-v0.2 would be used, but with added safety guarantees.