HarmReporter: Structured Harm Analysis for LLM Safety
HarmReporter is an 8 billion parameter open language model developed by jl3676, fine-tuned from meta-llama/Llama-3.1-8B-Instruct. Its core function is to generate a structured "harm tree" for any given prompt, providing a detailed analysis of potential negative impacts.
Key Capabilities
- Structured Harm Tree Generation: For a given prompt, HarmReporter identifies:
- Stakeholders: Individuals, groups, communities, or entities that may be impacted.
- Harmful Actions: Categories of harmful actions that may affect each stakeholder.
- Harmful Effects: Specific effects each action may cause, categorized by type.
- Impact Metrics: Likelihood, severity, and immediacy for each harmful effect.
- Performance: Achieves state-of-the-art performance in prompt safety classification, outperforming WildGuard, Llama-Guard-3, and ShieldGemma on average F1 score across five public benchmarks.
- Interoperability: Can be combined with BenefitReporter to create a comprehensive harm-benefit tree, forming the basis of the SafetyReporter system.
Intended Uses
- Harmfulness Analysis: Provides a detailed breakdown of potential harms when an AI model responds to a user prompt, aiding in identifying risks.
- Moderation Tool: When integrated with BenefitReporter and an aggregation algorithm, it can generate a harmfulness score for prompts, serving as an effective moderation mechanism.
Limitations
While highly effective, HarmReporter may occasionally generate inaccurate features, and its aggregated harmfulness scores might not always lead to correct judgments. Users should be aware of these potential inaccuracies.