cs-552-2026-Flash-McQueenS-and-TheKing/safety_model
The safety_model by cs-552-2026-Flash-McQueenS-and-TheKing is a 1.7 billion parameter Qwen3-based language model specifically fine-tuned for safety multiple-choice questions. It is optimized to provide direct, non-thinking answers with a one-sentence justification followed by the answer letter in a boxed format. This model excels at knowledge and norm-judgment tasks within safety benchmarks, making it suitable for research in AI safety evaluation.
Loading preview...
Overview
This model, developed by cs-552-2026-Flash-McQueenS-and-TheKing, is a supervised fine-tune of Qwen/Qwen3-1.7B designed for safety multiple-choice questions. It operates in a "non-thinking" mode, providing a concise, one-sentence justification followed by the answer letter in a \boxed{} format, without extensive reasoning blocks. The model's output contract ensures every answer ends with the option letter wrapped in \boxed{...}.
Key Capabilities
- Specialized Safety Evaluation: Fine-tuned on 3,250 English multiple-choice items across seven safety categories from SafetyBench (Zhang et al., 2024), including Unfairness & Bias, Ethics & Morality, and Physical Health.
- Direct Answering: Optimized for pass@1 benchmarks by directly emitting answers with a brief justification, avoiding lengthy reasoning that can be less effective for classification-style safety tasks.
- Robust Training: Utilizes LoRA fine-tuning, merged into a full checkpoint, with careful data processing including letter balancing, synthetic validation, and decontamination against the SafetyBench test split.
Good For
- Research in AI Safety: Intended as a research/coursework artifact for answering English safety multiple-choice questions in a specific
\boxed{<letter>}format. - Knowledge and Norm-Judgment Tasks: Excels in scenarios where safety questions primarily involve knowledge recall and ethical judgment rather than multi-step deduction.
Limitations
- Performance on items with more than 4 options is less certain due to training data distribution.
- Stronger on categories derived from public datasets (700 items each) compared to LLM-generated categories (150 items each).
- Not a deployable safety system; it is designed for fixed-format MCQ tasks and should not be used for content moderation or refusal systems.