Zephyr RMU: Unlearning Hazardous Knowledge
cais/Zephyr_RMU is a 7 billion parameter language model derived from the zephyr-7B-beta base model. Its primary distinction lies in the application of Representation Misdirection for Unlearning (RMU), a technique aimed at reducing the model's ability to generate hazardous content, particularly concerning biosecurity and cybersecurity.
Key Capabilities & Features
- Hazardous Content Reduction: Significantly "unlearns" knowledge related to biosecurity and cybersecurity, as demonstrated by reduced accuracy on the WMDP benchmark.
- Safety-Oriented Fine-tuning: Utilizes the RMU method to mitigate malicious use potential without drastically impacting general performance.
- Competitive General Performance: Maintains strong performance on general language understanding tasks, with MMLU and MT-Bench scores comparable to the original Zephyr 7B model.
Performance Highlights
Evaluations on the WMDP, MMLU, and MT-Bench datasets showcase the model's effectiveness:
- WMDP-Bio: Accuracy reduced from 63.7% (Zephyr 7B) to 31.2% (Zephyr RMU).
- WMDP-Cyber: Accuracy reduced from 44.0% (Zephyr 7B) to 28.2% (Zephyr RMU).
- MMLU: Maintained at 57.1% (compared to 58.1% for Zephyr 7B).
- MT-Bench: Maintained at 7.10 (compared to 7.33 for Zephyr 7B).
This model is ideal for applications where mitigating the generation of sensitive or harmful information in specific domains is critical, while still requiring robust general language capabilities.