thu-coai/Mistral-7B-Instruct-v0.2-safeunlearning
The thu-coai/Mistral-7B-Instruct-v0.2-safeunlearning model is a 7 billion parameter instruction-tuned language model, derived from Mistral-7B-Instruct-v0.2. Developed by thu-coai, it has undergone a safe unlearning process to enhance safety against jailbreak attacks while preserving general performance. This model is specifically optimized for applications requiring robust safety against harmful prompts, making it suitable for sensitive conversational AI. It maintains the original Mistral-7B-Instruct-v0.2 prompt format and a 4096-token context length.
Loading preview...
Model Overview
The thu-coai/Mistral-7B-Instruct-v0.2-safeunlearning is a 7 billion parameter instruction-tuned language model based on the original Mistral-7B-Instruct-v0.2. This version has been specifically modified by thu-coai through a "safe unlearning" process, targeting 100 raw harmful questions during its training. The primary goal of this unlearning was to significantly improve the model's resilience against various jailbreak attacks, making it a safer option for deployment in sensitive applications.
Key Capabilities
- Enhanced Safety: Demonstrates significantly improved resistance to jailbreak attempts compared to its base model.
- Performance Preservation: Maintains general performance levels comparable to the original Mistral-7B-Instruct-v0.2, ensuring its utility across a broad range of tasks.
- Instruction Following: Retains strong instruction-following capabilities, consistent with the base Mistral-7B-Instruct-v0.2.
- Standard Prompt Format: Utilizes the same prompt format as the original Mistral-7B-Instruct-v0.2, allowing for seamless integration into existing workflows.
Good For
- Applications requiring high safety: Ideal for chatbots, virtual assistants, and other conversational AI systems where mitigating harmful outputs and jailbreaks is critical.
- Research into model safety and unlearning: Provides a practical example of safe unlearning techniques applied to a large language model.
- General instruction-following tasks: Suitable for a wide array of natural language processing tasks where the base Mistral-7B-Instruct-v0.2 would be used, but with added safety guarantees.