RealSafe-R1-7B: Safety-Enhanced LLM
RealSafe-R1-7B is a 7.6 billion parameter language model developed by RealSafe, built upon the DeepSeek-R1-Distill-Qwen-7B architecture. Its primary differentiator is its enhanced safety awareness, specifically fine-tuned to improve robustness against malicious queries and jailbreak attacks. This model leverages supervised fine-tuning (SFT) on custom safety-focused datasets to achieve superior refusal mechanisms for adversarial prompts.
Key Capabilities
- Superior Safety: Achieves significantly higher refusal rates against jailbreak attacks (e.g., 99.78% on 'None' StrongReject, compared to 55.06% for the base model).
- Retained Reasoning: Maintains high-quality performance across diverse reasoning tasks, including common sense, logic, and mathematical problems, demonstrating minimal degradation compared to its base model.
- Harmful Content Refusal: Effectively detects and refuses prompts requesting assistance with unethical, illegal, or policy-violating activities, as showcased in case studies involving deceptive emails and illegal operations.
Ideal Use Cases
RealSafe-R1-7B is particularly well-suited for applications where safety and robustness against adversarial inputs are critical. This includes:
- Customer Service Bots: Ensuring responses remain ethical and compliant.
- Content Moderation: Aiding in the identification and refusal of harmful content generation.
- Secure AI Assistants: Providing a safer foundation for interactive AI systems that might encounter malicious user prompts.