cs-552-2026-thinking-tokens/safety_model
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:May 18, 2026License:apache-2.0Architecture:Transformer Open Weights Warm
The cs-552-2026-thinking-tokens/safety_model is a 2 billion parameter language model, based on Qwen/Qwen3-1.7B, fine-tuned using LoRA SFT and DPO methods. It is specifically optimized for safety benchmarks, incorporating diverse safety-focused datasets like SafetyBench, BeaverTails, and HH-RLHF. This model excels at identifying and responding to safety-related prompts, making it suitable for content moderation and safety-critical applications.
Loading preview...
Model Overview
The cs-552-2026-thinking-tokens/safety_model is a 2 billion parameter language model built upon the Qwen/Qwen3-1.7B architecture. It has undergone LoRA-based Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to enhance its capabilities in safety-related tasks.
Key Capabilities
- Safety-Oriented Responses: Trained extensively on datasets like SafetyBench, BeaverTails, and HH-RLHF to generate safe and appropriate outputs.
- Multiple-Choice and Free-Form Safety Evaluation: Supports both multiple-choice safety questions (ending with
\boxed{<LETTER>}) and free-form safety judgments (ending with\boxed{Safe}). - Efficient Training: Utilizes LoRA for efficient fine-tuning, completing SFT in ~28 minutes and DPO in ~1 hour 35 minutes on an A100 40G GPU.
- Performance on Safety Benchmarks: Achieves 70% pass@1 (greedy) on
validation_samples/safety.jsonland 68.0% on the held-out BeaverTails 30k_test set.
Limitations
- Bias Regression: Exhibits a decrease in willingness to flag bias due to DPO data favoring polite, non-confrontational responses.
- Offensiveness Blind Spot: Lacks specific training data for judging text offensiveness.
- English-Centric DPO: While supporting Chinese parity, DPO data is primarily English, limiting growth in other languages.
Good For
- Applications requiring robust safety filtering and content moderation.
- Developing AI systems that need to adhere to strict safety guidelines.
- Research into safety alignment and preference optimization techniques.