jl3676/HarmReporter

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Oct 16, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

HarmReporter by jl3676 is an 8 billion parameter language model, fine-tuned from Llama-3.1-8B-Instruct, designed to generate structured "harm trees" for given prompts. It identifies stakeholders, harmful actions, effects, and their likelihood, severity, and immediacy. This model specializes in harmfulness analysis and moderation, outperforming other open-source baselines in prompt harmfulness classification benchmarks.

Loading preview...

HarmReporter: Structured Harm Analysis for LLM Safety

HarmReporter is an 8 billion parameter open language model developed by jl3676, fine-tuned from meta-llama/Llama-3.1-8B-Instruct. Its core function is to generate a structured "harm tree" for any given prompt, providing a detailed analysis of potential negative impacts.

Key Capabilities

  • Structured Harm Tree Generation: For a given prompt, HarmReporter identifies:
    • Stakeholders: Individuals, groups, communities, or entities that may be impacted.
    • Harmful Actions: Categories of harmful actions that may affect each stakeholder.
    • Harmful Effects: Specific effects each action may cause, categorized by type.
    • Impact Metrics: Likelihood, severity, and immediacy for each harmful effect.
  • Performance: Achieves state-of-the-art performance in prompt safety classification, outperforming WildGuard, Llama-Guard-3, and ShieldGemma on average F1 score across five public benchmarks.
  • Interoperability: Can be combined with BenefitReporter to create a comprehensive harm-benefit tree, forming the basis of the SafetyReporter system.

Intended Uses

  • Harmfulness Analysis: Provides a detailed breakdown of potential harms when an AI model responds to a user prompt, aiding in identifying risks.
  • Moderation Tool: When integrated with BenefitReporter and an aggregation algorithm, it can generate a harmfulness score for prompts, serving as an effective moderation mechanism.

Limitations

While highly effective, HarmReporter may occasionally generate inaccurate features, and its aggregated harmfulness scores might not always lead to correct judgments. Users should be aware of these potential inaccuracies.