The metricspace/GDPR_Input_Detection_and_Anonymization_0.5B is a 0.5 billion parameter model designed to act as a firewall or proxy for user inputs to external LLMs. It analyzes prompts to provide a complexity score, guiding the selection of appropriate LLMs, and a sensitivity score to detect and anonymize confidential information. This model specializes in local data protection and anonymization, supporting 29 languages with a context length of 131072 tokens.
Loading preview...
Model Overview
The metricspace/GDPR_Input_Detection_and_Anonymization_0.5B is a specialized 0.5 billion parameter model designed to safeguard sensitive user inputs before they are processed by larger, external AI models. It functions as a local firewall or proxy, providing two critical scores for each prompt: complexity and sensitivity.
Key Capabilities
- Complexity Scoring: Rates task complexity from 1 to 10, helping users select the most cost-effective and appropriate LLM (e.g., smaller models for low scores, powerful models like GPT-4o for high scores). This optimizes resource usage and reduces costs.
- Sensitivity Scoring: Assesses prompt confidentiality from 0 (public) to 3 (highly critical), enabling blocking or anonymization of sensitive data to prevent unauthorized exposure and ensure GDPR compliance.
- Anonymization and Re-Anonymization: Detects and replaces specific entities (e.g., locations, names, dates) based on configurable settings, allowing for secure processing by external LLMs and subsequent restoration of original entities.
- Multilingual Support: Trained with a mixture of English (80%) and multilingual (20%) examples, supporting 29 languages.
Good For
- Protecting sensitive data: Ideal for applications requiring local pre-processing of user inputs to remove or anonymize confidential information before sending to cloud-based LLMs.
- Optimizing LLM usage: Helps in dynamically selecting the right LLM based on task complexity, leading to cost savings and efficient resource allocation.
- GDPR compliance: Provides a mechanism to handle personal and confidential data in accordance with privacy regulations by preventing its direct exposure to external models.
Limitations
For complexity and sensitivity scoring, the model processes inputs up to 2,048 tokens. For entity detection, the combined input and output limit is 3,000 tokens. Exceeding these limits may lead to truncated outputs or inconsistent behavior.