MedPHINER-Llama-3.1-Swallow-8B-Instruct-v0.5: Japanese Medical PHI Tagging Model
This model, developed by sociocom, is an 8 billion parameter language model built upon the Llama-3.1-Swallow-8B-Instruct-v0.5 base. It has been specifically fine-tuned using LoRA for the task of identifying and tagging Protected Health Information (PHI) within Japanese medical texts.
Key Capabilities
- PHI Tagging: Accurately identifies and assigns specific tags to various types of personal health information in Japanese.
- PHI Categories: Recognizes and tags:
<phi_age>: Age<phi_id>: Identification numbers<phi_tel>: Telephone numbers<phi_job>: Occupations<phi_location>: Addresses and place names<phi_person>: Person names<phi_hospital>: Medical institution names
- Japanese Medical Context: Optimized for the nuances of Japanese medical language and data.
Training Details
The model was fine-tuned on 11,127 sentences from the NTCIR dataset. Personal information insertion and annotation for the training data were performed using OpenAI API (gpt-5.2-2025-12-11). Training utilized LoRA with a rank of 16, alpha of 64, and a dropout of 0.05 over 5 epochs, with a batch size of 8 and a learning rate of 2e-4 using the AdamW optimizer.
Good For
- Automated PHI Redaction: Ideal for applications requiring the automatic identification and masking of sensitive patient data in Japanese medical records.
- Data Anonymization: Useful for preparing medical datasets for research or sharing while ensuring patient privacy.
- Compliance: Supports efforts to comply with data privacy regulations in healthcare by accurately pinpointing PHI.