cs-552-2026-thinking-tokens/safety_model

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:May 18, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The cs-552-2026-thinking-tokens/safety_model is a 2 billion parameter language model, based on Qwen/Qwen3-1.7B, fine-tuned using LoRA SFT and DPO methods. It is specifically optimized for safety benchmarks, incorporating diverse safety-focused datasets like SafetyBench, BeaverTails, and HH-RLHF. This model excels at identifying and responding to safety-related prompts, making it suitable for content moderation and safety-critical applications.

Loading preview...

Model Overview

The cs-552-2026-thinking-tokens/safety_model is a 2 billion parameter language model built upon the Qwen/Qwen3-1.7B architecture. It has undergone LoRA-based Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to enhance its capabilities in safety-related tasks.

Key Capabilities

  • Safety-Oriented Responses: Trained extensively on datasets like SafetyBench, BeaverTails, and HH-RLHF to generate safe and appropriate outputs.
  • Multiple-Choice and Free-Form Safety Evaluation: Supports both multiple-choice safety questions (ending with \boxed{<LETTER>}) and free-form safety judgments (ending with \boxed{Safe}).
  • Efficient Training: Utilizes LoRA for efficient fine-tuning, completing SFT in ~28 minutes and DPO in ~1 hour 35 minutes on an A100 40G GPU.
  • Performance on Safety Benchmarks: Achieves 70% pass@1 (greedy) on validation_samples/safety.jsonl and 68.0% on the held-out BeaverTails 30k_test set.

Limitations

  • Bias Regression: Exhibits a decrease in willingness to flag bias due to DPO data favoring polite, non-confrontational responses.
  • Offensiveness Blind Spot: Lacks specific training data for judging text offensiveness.
  • English-Centric DPO: While supporting Chinese parity, DPO data is primarily English, limiting growth in other languages.

Good For

  • Applications requiring robust safety filtering and content moderation.
  • Developing AI systems that need to adhere to strict safety guidelines.
  • Research into safety alignment and preference optimization techniques.