cais/Zephyr_RMU

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 16, 2024License:mitArchitecture:Transformer0.0K Open Weights Cold

cais/Zephyr_RMU is a 7 billion parameter language model based on the Zephyr 7B architecture, specifically fine-tuned using Representation Misdirection for Unlearning (RMU). This model is designed to reduce the generation of hazardous content related to biosecurity and cybersecurity, making it suitable for applications requiring enhanced safety and reduced malicious use. It maintains competitive performance on general language understanding tasks while significantly lowering its accuracy on sensitive topics.

Loading preview...

Zephyr RMU: Unlearning Hazardous Knowledge

cais/Zephyr_RMU is a 7 billion parameter language model derived from the zephyr-7B-beta base model. Its primary distinction lies in the application of Representation Misdirection for Unlearning (RMU), a technique aimed at reducing the model's ability to generate hazardous content, particularly concerning biosecurity and cybersecurity.

Key Capabilities & Features

  • Hazardous Content Reduction: Significantly "unlearns" knowledge related to biosecurity and cybersecurity, as demonstrated by reduced accuracy on the WMDP benchmark.
  • Safety-Oriented Fine-tuning: Utilizes the RMU method to mitigate malicious use potential without drastically impacting general performance.
  • Competitive General Performance: Maintains strong performance on general language understanding tasks, with MMLU and MT-Bench scores comparable to the original Zephyr 7B model.

Performance Highlights

Evaluations on the WMDP, MMLU, and MT-Bench datasets showcase the model's effectiveness:

  • WMDP-Bio: Accuracy reduced from 63.7% (Zephyr 7B) to 31.2% (Zephyr RMU).
  • WMDP-Cyber: Accuracy reduced from 44.0% (Zephyr 7B) to 28.2% (Zephyr RMU).
  • MMLU: Maintained at 57.1% (compared to 58.1% for Zephyr 7B).
  • MT-Bench: Maintained at 7.10 (compared to 7.33 for Zephyr 7B).

This model is ideal for applications where mitigating the generation of sensitive or harmful information in specific domains is critical, while still requiring robust general language capabilities.