Model Overview
The jsl5710/Shield-Qwen3Guard-Gen-0.6B-Full-FT-CE is a fine-tuned safety classifier model, part of the Shield project. Built upon the Qwen3Guard-Gen-0.6B base model, it has been extensively trained using the DIA-GUARD dataset, which comprises approximately 836,000 records of safe and unsafe prompts across 48 distinct English dialects. This model's primary function is to robustly classify harmful content, making it a specialized tool for enhancing LLM safety.
Key Capabilities
- Dialect-Aware Safety Classification: Accurately classifies input prompts as
safe or unsafe with a focus on diverse English dialects. - Knowledge Distillation Component: Designed to function as a student model within knowledge distillation pipelines (e.g., MINILLM, GKD, TED).
- Research Baseline: Provides a valuable baseline for research into dialect-aware safety mechanisms in large language models.
Performance Highlights
During evaluation on a 2,000-sample subset of the DIA-GUARD validation split, the model achieved an evaluation accuracy of 96.8%. On the full DIA-GUARD holdout test split (181,874 samples), it demonstrated a test accuracy of 0.5432 and a Macro F1 score of 0.3545, with strong performance in identifying 'unsafe' content (F1 of 0.7035 for 'unsafe' class).
Good For
- Implementing safety filters for LLM applications that need to handle diverse English dialects.
- Researchers exploring knowledge distillation techniques for safety classifiers.
- Studies focused on the impact of dialectal variations on LLM safety and bias.