naazimsnh02/qwen3-0.6b-pii-detector
The naazimsnh02/qwen3-0.6b-pii-detector is a 0.8 billion parameter Qwen3-based model fine-tuned for detecting Personally Identifiable Information (PII) and Protected Health Information (PHI) in text. It specializes in Named Entity Recognition (NER) for over 55 entity types, outputting results in an inline tagging format. This model is optimized for context-aware PII/PHI detection across diverse domains and locales, making it suitable for compliance and data anonymization workflows.
Loading preview...
Overview
This model, naazimsnh02/qwen3-0.6b-pii-detector, is a specialized 0.8 billion parameter variant of the Qwen3-0.6B base model. It has been fine-tuned using LoRA with Unsloth on the nvidia/Nemotron-PII dataset, comprising 47,500 training samples, to perform Named Entity Recognition (NER) for PII and PHI.
Key Capabilities
- PII/PHI Detection: Identifies over 55 types of sensitive information, including personal identifiers, contact details, medical information, financial data, and digital identifiers.
- Inline Tagging: Outputs detected entities using an
[entity]labelformat, facilitating easy extraction and processing. - Context-Aware: Enhanced accuracy by optionally accepting domain (e.g., healthcare, finance), document type, and locale (US/international) information during inference.
- Natural Language Processing: Designed to work effectively across conversations, documents, forms, and unstructured text.
Performance & Training
The model completed 2.096 epochs of training, achieving a final training loss of 0.4155 and a best evaluation loss of 0.4551. It was trained with a max sequence length of 2048 on an Nvidia L4 GPU.
Limitations
- Context Dependency: Accuracy is best when domain context is provided.
- Language: Primarily trained and optimized for English text.
- Ambiguity: May struggle with ambiguous entities or novel types not in its training data.
Recommended Use Cases
- Automated redaction and data anonymization pipelines.
- Compliance monitoring for regulations like GDPR, HIPAA, and CCPA.
- Document sanitization before sharing or processing.
- Privacy-preserving data analysis.