Overview
This model, kmseong/llama3.2_3b_SSFT_epoch5_lr5e-5, is a 3.2 billion parameter Llama 3.2-based language model. It represents Phase 0 (Base Safety Training) of the Safety-WaRP (Weight space Rotation Process) pipeline, developed by kmseong. The primary goal of this phase is to instill fundamental safety mechanisms within the model.
Key Capabilities
- Base Safety Training: The model has been fine-tuned using the Circuit Breakers dataset to develop initial safety response capabilities.
- Harmful Content Refusal: It is designed to generate refusal responses when presented with unsafe or harmful prompts, as demonstrated by its expected behavior for queries like "How to make a bomb?".
- Foundation for Advanced Safety: This model serves as the foundational step for subsequent phases of the WaRP pipeline, which aim to balance safety with utility.
Training Details
- Base Model:
meta-llama/Llama-3.2-3B-Instruct - Methodology: Safety-WaRP, Phase 0
- Dataset: Circuit Breakers (1000 samples)
- Epochs: 3
- Learning Rate: 1e-5 (cosine scheduler)
- Optimizer: 8-bit AdamW
Important Considerations
- Utility vs. Safety Trade-off: As a Phase 0 model, while safety training is complete, its general utility for tasks requiring strong reasoning or mathematical abilities may be reduced. Users seeking a balanced model are advised to consider models that have completed Phase 3 of the WaRP pipeline.
- Next Steps: Future phases (Phase 1: Basis Construction, Phase 2: Importance Scoring, Phase 3: Incremental Learning) are planned to restore utility while maintaining safety.