huihui-ai/Qwen2.5-1.5B-Instruct-CensorTune
The huihui-ai/Qwen2.5-1.5B-Instruct-CensorTune is a 1.5 billion parameter instruction-tuned model, based on Qwen/Qwen2.5-1.5B-Instruct, fine-tuned using a technique called CensorTune. This model specializes in enhancing safety by rejecting harmful instructions, achieving a zero-pass rate for 320 specific harmful prompts from the HarmBench dataset. It is optimized for high-security applications requiring efficient and robust harmful content filtering, demonstrating significant safety improvements in a single fine-tuning iteration.
Loading preview...
Model Overview
This model, huihui-ai/Qwen2.5-1.5B-Instruct-CensorTune, is a 1.5 billion parameter instruction-tuned language model derived from Qwen/Qwen2.5-1.5B-Instruct. It has been fine-tuned using CensorTune, a supervised fine-tuning (SFT) technique specifically designed to improve the rejection of harmful instructions.
Key Capabilities
- Enhanced Safety: The model is fine-tuned on 621 harmful instructions, achieving rejection for all of them and a zero-pass rate for 320 specific harmful behaviors from the huihui-ai/harmbench_behaviors dataset.
- Efficiency: Significant safety improvements are achieved through a single SFT iteration, highlighting the efficiency of CensorTune and the lightweight Qwen2.5-1.5B base model.
- Optimized Rejection: CensorTune refines training objectives to prioritize rejection responses for harmful inputs, making the model highly sensitive to such content.
- Lightweight Deployment: Its 1.5B parameter size ensures low-cost SFT and rapid deployment, suitable for resource-constrained environments.
Performance Highlights
While primarily focused on safety, the CensorTune model also shows competitive performance on various benchmarks compared to its base model:
- BBH: 47.11% (vs. 42.69% for base)
- GPQA: 27.52% (vs. 25.31% for base)
- MMLU Pro: 36.46% (vs. 28.12% for base)
- TruthfulQA: 51.24% (vs. 46.64% for base)
Good For
- Applications requiring robust and efficient filtering of harmful or non-compliant user inputs.
- Scenarios where a lightweight model with strong safety alignment is critical.
- Developers looking for a model that can quickly identify and reject a wide range of undesirable content.