rishiskhare/gemma-3-promptshield
rishiskhare/gemma-3-promptshield is a 0.3 billion parameter Gemma-3 model fine-tuned by rishiskhare, specifically designed for detecting prompt injection attacks. This model classifies inputs as either containing a prompt injection (1) or being safe (0), achieving a ROC AUC of 0.9652 and 89.89% accuracy on the PromptShield dataset. It is primarily intended for enhancing the security of LLM applications by filtering malicious inputs and for red teaming to analyze vulnerabilities.
Loading preview...
Model Overview
rishiskhare/gemma-3-promptshield is a specialized 0.3 billion parameter language model, fine-tuned from unsloth/gemma-3-270m-it using the Unsloth framework. Its core function is to identify and classify prompt injection attacks within user inputs, distinguishing between malicious and safe prompts.
Key Capabilities
- Prompt Injection Detection: Classifies inputs as '1' (injection detected) or '0' (safe).
- Security Filtering: Designed to improve the robustness and safety of large language model applications.
- Red Teaming: Useful for analyzing and identifying potential prompt injection vulnerabilities in LLM systems.
Performance Metrics
Evaluated on the hendzh/PromptShield test set (2,940 samples), the model demonstrates strong performance in identifying prompt injections:
- ROC AUC: 0.9652
- Accuracy: 89.89%
- F1 Score: 0.7990
Intended Use Cases
This model is ideal for developers and security researchers looking to integrate a dedicated prompt injection detection layer into their LLM-powered applications or for conducting security assessments.