Overview
Qwen3-4B-SafeRL: Safety-Aligned Language Model
Qwen3-4B-SafeRL is a 4 billion parameter model from the Qwen family, specifically designed for enhanced safety. It is a safety-aligned version of Qwen3-4B, trained using Reinforcement Learning (RL) with a reward signal from Qwen3Guard-Gen. This alignment process focuses on improving robustness against harmful or adversarial prompts without resorting to overly simplistic refusals, thus preserving a positive user experience.
Key Capabilities
- Enhanced Safety: Achieves significantly higher safety rates (e.g., 86.5% on Qwen3-235B and 98.1% on WildGuard in non-thinking mode) compared to its base model.
- Hybrid Reward Optimization: Employs a unique hybrid reward function during RL, balancing three objectives:
- Safety Maximization: Penalizes unsafe content generation.
- Helpfulness Maximization: Rewards genuinely helpful responses.
- Refusal Minimization: Applies a moderate penalty for unnecessary refusals.
- Maintains Helpfulness: Despite safety alignment, it largely retains helpfulness, showing competitive performance on benchmarks like ArenaHard-v2.
- Thinking Modes: Preserves the ability of hybrid thinking modes, allowing for more complex reasoning when enabled.
Good For
- Applications requiring strong safety guarantees against harmful content.
- Conversational AI where maintaining helpfulness and minimizing unwarranted refusals are crucial.
- Developers looking for a robust, safety-aligned base model for further fine-tuning in sensitive domains.