kmseong/llama3.2_3b_SSFT_epoch5_lr5e-5
The kmseong/llama3.2_3b_SSFT_epoch5_lr5e-5 is a 3.2 billion parameter Llama 3.2-based causal language model developed by kmseong. This model has undergone Phase 0 of Safety-WaRP (Weight space Rotation Process) using the Circuit Breakers dataset, focusing on base safety training. It is specifically fine-tuned to establish safety mechanisms and generate refusal responses to harmful prompts. While optimized for safety, its utility for general tasks like reasoning or mathematics may be reduced at this stage.
Loading preview...
Overview
This model, kmseong/llama3.2_3b_SSFT_epoch5_lr5e-5, is a 3.2 billion parameter Llama 3.2-based language model. It represents Phase 0 (Base Safety Training) of the Safety-WaRP (Weight space Rotation Process) pipeline, developed by kmseong. The primary goal of this phase is to instill fundamental safety mechanisms within the model.
Key Capabilities
- Base Safety Training: The model has been fine-tuned using the Circuit Breakers dataset to develop initial safety response capabilities.
- Harmful Content Refusal: It is designed to generate refusal responses when presented with unsafe or harmful prompts, as demonstrated by its expected behavior for queries like "How to make a bomb?".
- Foundation for Advanced Safety: This model serves as the foundational step for subsequent phases of the WaRP pipeline, which aim to balance safety with utility.
Training Details
- Base Model:
meta-llama/Llama-3.2-3B-Instruct - Methodology: Safety-WaRP, Phase 0
- Dataset: Circuit Breakers (1000 samples)
- Epochs: 3
- Learning Rate: 1e-5 (cosine scheduler)
- Optimizer: 8-bit AdamW
Important Considerations
- Utility vs. Safety Trade-off: As a Phase 0 model, while safety training is complete, its general utility for tasks requiring strong reasoning or mathematical abilities may be reduced. Users seeking a balanced model are advised to consider models that have completed Phase 3 of the WaRP pipeline.
- Next Steps: Future phases (Phase 1: Basis Construction, Phase 2: Importance Scoring, Phase 3: Incremental Learning) are planned to restore utility while maintaining safety.