cs-552-2026-claude-bots/safety_model
The safety_model, developed by cs-552-2026-claude-bots, is a fine-tuned version of Qwen/Qwen3-1.7B, specifically designed for safe, ethical, and responsible multiple-choice question answering. It features a unique reasoning behavior, generating explicit qualitative reasoning traces within ... tags before providing a final answer in a \boxed{A} format. This model was trained using a two-stage alignment pipeline, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), achieving 81% accuracy on unseen test subsets.
Loading preview...
Overview
safety_model is a specialized language model developed by cs-552-2026-claude-bots, fine-tuned from Qwen/Qwen3-1.7B. Its primary purpose is to answer multiple-choice questions with a strong emphasis on safety, ethics, and responsibility. The model incorporates a unique reasoning mechanism, where it first generates an explicit thought process within <think>...</think> tags before producing its final answer in a \boxed{A} format.
Key Capabilities
- Safe and Ethical Reasoning: Designed to provide responsible answers to multiple-choice questions.
- Structured Reasoning: Generates qualitative reasoning traces, making its decision-making process transparent.
- Precise Output Formatting: Enforces a strict output format (
<think>...</think>\boxed{...}) for clarity and consistency. - Robust Training: Utilizes a two-stage alignment pipeline:
- Stage 1: Thinking Intervention (TI) via SFT: Taught to generate reasoning traces using a subset of STAR-41K.
- Stage 2: Reinforcement Learning with Verifiable Rewards (RLVR): Enforces exact MCQ formatting and factual correctness using MMLU and SafetyBench datasets, employing style and correctness rewards.
Performance
- Achieved 81% accuracy on a strictly unseen test subset during Continuous Integration (CI) evaluation.
When to Use This Model
This model is ideal for applications requiring highly reliable and transparent multiple-choice question answering, particularly in contexts where safety, ethical considerations, and structured reasoning are paramount. Its explicit reasoning traces can be valuable for auditing and understanding the model's decision-making process.