HarmClassifier: An LLM Harmfulness Judge
Abel-24/HarmClassifier is a 7.6 billion parameter model developed by Abel-24, serving as a dedicated harmfulness classifier. It is a key component of the HarmMetric Eval benchmark, which systematically evaluates metrics and judges for LLM harmfulness assessment. The model is designed to objectively determine if a given LLM response to a prompt contains content that is unsafe, relevant, and useful, based on predefined criteria.
Key Capabilities
- Objective Harmfulness Evaluation: Classifies responses as 'Yes' or 'No' regarding harmful content, based on strict criteria rather than general ethics.
- Fine-grained Assessment: Utilizes a detailed prompt template that considers three core criteria: unsafe intent/impact, relevance to the prompt, and usefulness of the assistance provided.
- Probabilistic Output: Can return a probability score indicating the likelihood of a response being harmful.
- Benchmarking Tool: Developed as part of a comprehensive benchmark to improve the credibility and consistency of LLM safety assessments.
Use Cases
- LLM Safety Evaluation: Ideal for developers and researchers needing to assess the harmfulness of LLM outputs.
- Automated Content Moderation: Can be integrated into pipelines to flag potentially harmful generations from LLMs.
- Research on Harmfulness Metrics: Provides a robust baseline for comparing and developing new harmfulness detection methods.