Qwen/Qwen3-4B-SafeRL

Warm
Public
4B
BF16
40960
License: apache-2.0
Hugging Face
Overview

Qwen3-4B-SafeRL: Safety-Aligned Language Model

Qwen3-4B-SafeRL is a 4 billion parameter model from the Qwen family, specifically designed for enhanced safety. It is a safety-aligned version of Qwen3-4B, trained using Reinforcement Learning (RL) with a reward signal from Qwen3Guard-Gen. This alignment process focuses on improving robustness against harmful or adversarial prompts without resorting to overly simplistic refusals, thus preserving a positive user experience.

Key Capabilities

  • Enhanced Safety: Achieves significantly higher safety rates (e.g., 86.5% on Qwen3-235B and 98.1% on WildGuard in non-thinking mode) compared to its base model.
  • Hybrid Reward Optimization: Employs a unique hybrid reward function during RL, balancing three objectives:
    • Safety Maximization: Penalizes unsafe content generation.
    • Helpfulness Maximization: Rewards genuinely helpful responses.
    • Refusal Minimization: Applies a moderate penalty for unnecessary refusals.
  • Maintains Helpfulness: Despite safety alignment, it largely retains helpfulness, showing competitive performance on benchmarks like ArenaHard-v2.
  • Thinking Modes: Preserves the ability of hybrid thinking modes, allowing for more complex reasoning when enabled.

Good For

  • Applications requiring strong safety guarantees against harmful content.
  • Conversational AI where maintaining helpfulness and minimizing unwarranted refusals are crucial.
  • Developers looking for a robust, safety-aligned base model for further fine-tuning in sensitive domains.