Name: cs-552-2026-thinking-tokens/safety_model API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: cs-552-2026-thinking-tokens

Model Overview

The cs-552-2026-thinking-tokens/safety_model is a 2 billion parameter language model built upon the Qwen/Qwen3-1.7B architecture. It has undergone LoRA-based Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to enhance its capabilities in safety-related tasks.

Key Capabilities

Safety-Oriented Responses: Trained extensively on datasets like SafetyBench, BeaverTails, and HH-RLHF to generate safe and appropriate outputs.
Multiple-Choice and Free-Form Safety Evaluation: Supports both multiple-choice safety questions (ending with \boxed{<LETTER>}) and free-form safety judgments (ending with \boxed{Safe}).
Efficient Training: Utilizes LoRA for efficient fine-tuning, completing SFT in ~28 minutes and DPO in ~1 hour 35 minutes on an A100 40G GPU.
Performance on Safety Benchmarks: Achieves 70% pass@1 (greedy) on validation_samples/safety.jsonl and 68.0% on the held-out BeaverTails 30k_test set.

Limitations

Bias Regression: Exhibits a decrease in willingness to flag bias due to DPO data favoring polite, non-confrontational responses.
Offensiveness Blind Spot: Lacks specific training data for judging text offensiveness.
English-Centric DPO: While supporting Chinese parity, DPO data is primarily English, limiting growth in other languages.

Good For

Applications requiring robust safety filtering and content moderation.
Developing AI systems that need to adhere to strict safety guidelines.
Research into safety alignment and preference optimization techniques.

Overview

Model Overview

Key Capabilities

Limitations

Good For

Full Model Card (README)