ReasoningShield/ReasoningShield-1B is a 1 billion parameter specialized safety moderation model developed by ReasoningShield, built on Llama-3.2-1B-Instruct. It is designed to identify hidden risks within the intermediate reasoning steps of Large Reasoning Models (LRMs), rather than just final outputs. This model achieves state-of-the-art performance in CoT Moderation, outperforming larger models like LlamaGuard-4 and GPT-4o, while requiring low GPU memory for efficient deployment.
Loading preview...
ReasoningShield-1B: Specialized Safety Moderation for LLMs
ReasoningShield-1B is a 1 billion parameter safety moderation model specifically engineered to detect hidden risks within the intermediate reasoning steps (CoT) of Large Reasoning Models (LRMs). Unlike traditional moderation models that focus on final outputs, ReasoningShield provides stepwise risk analysis, addressing the "black-box" limitation and enhancing explainability.
Key Capabilities:
- State-of-the-Art CoT Moderation: Achieves over 91% average F1 on open-source LRM traces, significantly outperforming LlamaGuard-4 (by 36%) and GPT-4o (by 16%).
- Robust Generalization: Demonstrates strong performance across varied reasoning paradigms, cross-task scenarios, and unseen data distributions, despite being trained on a compact 7K-sample dataset.
- Efficient Design: Built on a compact base model (Llama-3.2-1B-Instruct), it requires low GPU memory (e.g., 2.3GB for the 1B version), making it cost-effective for resource-constrained environments.
- Comprehensive Risk Detection: Identifies risks across categories including Violence, Hate & Toxicity, Deception & Misinformation, Rights Violation, Sex, Child Abuse, CyberSecurity, Prohibited Items, Economic Harm, and Political Risks.
Training Details:
The model undergoes a two-stage training process: initial full-parameter fine-tuning on 4,358 agreed-on samples, followed by Direct Preference Optimization (DPO) on 2,642 hard negative samples to refine performance and enhance generalization. This approach enables strong performance even on traditional answer moderation, rivaling baselines trained on much larger datasets.