Model Overview
This model, kmseong/llama3.2_3b_SSFT_epoch5_lr4, is a 3.2 billion parameter Llama 3.2-based language model developed by Min-Seong Kim. It represents Phase 0 of the Safety-WaRP (Weight space Rotation Process) pipeline, focusing exclusively on establishing core safety mechanisms.
Key Capabilities & Training
- Base Safety Training: The model has been fine-tuned using the Circuit Breakers dataset to enhance its ability to detect and reject unsafe or harmful prompts.
- Safety-WaRP Method: Utilizes the Weight space Rotation Process for safety training, aiming to build robust safety features.
- Foundation for Future Phases: This Phase 0 model is intended as a base, with subsequent phases (Phase 1: Basis Construction, Phase 2: Importance Scoring, Phase 3: Incremental Learning) planned to restore utility while maintaining safety.
Important Considerations
- Safety-Optimized: While proficient in generating safe responses, this Phase 0 model may exhibit reduced utility in areas like mathematical reasoning or general knowledge due to its focused safety training.
- Recommended Use: For balanced performance in both safety and utility, it is advised to use models that have completed all three phases of the WaRP pipeline. This model is best suited for applications where initial safety filtering is paramount, or as a component in a multi-stage safety pipeline.