ReasoningShield/ReasoningShield-3B

TEXT GENERATIONConcurrency Cost:1Model Size:3.2BQuant:BF16Ctx Length:32kPublished:May 20, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

ReasoningShield/ReasoningShield-3B is a 3.2 billion parameter specialized safety moderation model developed by ReasoningShield. Built on Llama-3.2, it is designed to identify hidden risks within the intermediate reasoning steps of Large Reasoning Models (LRMs). This model excels at detecting harmful content concealed in reasoning traces, ensuring robust safety alignment for LRMs, and provides stepwise risk analysis.

Loading preview...

ReasoningShield-3B: Specialized Safety Moderation for LLMs

ReasoningShield-3B is a 3.2 billion parameter model specifically engineered to moderate the intermediate reasoning steps (Chain-of-Thought) of Large Reasoning Models (LRMs). Developed by ReasoningShield, it aims to detect hidden harmful content that might be obscured within seemingly innocuous reasoning processes, thereby enhancing the safety alignment of LRMs.

Key Capabilities and Differentiators

  • SOTA CoT Moderation: Achieves over 91% average F1 score on open-source LRM traces, significantly outperforming LlamaGuard-4 by 36% and GPT-4o by 16% in CoT moderation tasks.
  • Robust Generalization: Demonstrates strong performance across diverse reasoning paradigms, cross-task scenarios, and unseen data distributions, despite being trained on a compact 7K-sample dataset.
  • Enhanced Explainability: Provides stepwise risk analysis, offering transparency into the moderation process and addressing the "black-box" nature of traditional moderation models.
  • Efficient Design: Built on a compact Llama-3.2 base, requiring low GPU memory (e.g., 2.3GB for the 1B version), making it suitable for cost-effective deployment.
  • Comprehensive Risk Categories: Trained to identify risks across Violence, Hate & Toxicity, Deception & Misinformation, Rights Violation, Sex, Child Abuse, CyberSecurity, Prohibited Items, Economic Harm, and Political Risks.

Training Details

The model undergoes a two-stage training process: initial full-parameter fine-tuning on 4,358 agreed-on samples for structured analysis, followed by Direct Preference Optimization (DPO) on 2,642 hard negative samples to refine performance and enhance generalization. The training dataset, ReasoningShield-Dataset, focuses on (Query, CoT) pairs with detailed risk categorization and safety levels.

Performance

ReasoningShield-3B achieves the highest F1 scores across various benchmarks for CoT Moderation, including AIR, SALAD, BeaverTails, and Jailbreak datasets, for both open-source (OSS) and commercial (CSS) LRM samples. It also shows strong generalization on traditional Answer Moderation, rivaling baselines trained on much larger datasets.