cs-552-2026-Flash-McQueenS-and-TheKing/safety_model

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:May 5, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The safety_model by cs-552-2026-Flash-McQueenS-and-TheKing is a 1.7 billion parameter Qwen3-based language model specifically fine-tuned for safety multiple-choice questions. It is optimized to provide direct, non-thinking answers with a one-sentence justification followed by the answer letter in a boxed format. This model excels at knowledge and norm-judgment tasks within safety benchmarks, making it suitable for research in AI safety evaluation.

Loading preview...

Overview

This model, developed by cs-552-2026-Flash-McQueenS-and-TheKing, is a supervised fine-tune of Qwen/Qwen3-1.7B designed for safety multiple-choice questions. It operates in a "non-thinking" mode, providing a concise, one-sentence justification followed by the answer letter in a \boxed{} format, without extensive reasoning blocks. The model's output contract ensures every answer ends with the option letter wrapped in \boxed{...}.

Key Capabilities

  • Specialized Safety Evaluation: Fine-tuned on 3,250 English multiple-choice items across seven safety categories from SafetyBench (Zhang et al., 2024), including Unfairness & Bias, Ethics & Morality, and Physical Health.
  • Direct Answering: Optimized for pass@1 benchmarks by directly emitting answers with a brief justification, avoiding lengthy reasoning that can be less effective for classification-style safety tasks.
  • Robust Training: Utilizes LoRA fine-tuning, merged into a full checkpoint, with careful data processing including letter balancing, synthetic validation, and decontamination against the SafetyBench test split.

Good For

  • Research in AI Safety: Intended as a research/coursework artifact for answering English safety multiple-choice questions in a specific \boxed{<letter>} format.
  • Knowledge and Norm-Judgment Tasks: Excels in scenarios where safety questions primarily involve knowledge recall and ethical judgment rather than multi-step deduction.

Limitations

  • Performance on items with more than 4 options is less certain due to training data distribution.
  • Stronger on categories derived from public datasets (700 items each) compared to LLM-generated categories (150 items each).
  • Not a deployable safety system; it is designed for fixed-format MCQ tasks and should not be used for content moderation or refusal systems.