kmseong/llama3.2_3b_new_SSFT
The kmseong/llama3.2_3b_new_SSFT is a 3.2 billion parameter Llama 3.2-based causal language model developed by kmseong, specifically fine-tuned for safety using the Safety-WaRP (Weight space Rotation Process) method. This Phase 0 model has undergone base safety training with the Circuit Breakers dataset, focusing on building safety mechanisms. It is designed to provide refusal responses to unsafe prompts, serving as a foundational model for further utility-focused training phases.
Loading preview...
Overview
The kmseong/llama3.2_3b_new_SSFT is a 3.2 billion parameter model built on the meta-llama/Llama-3.2-3B-Instruct architecture. It represents Phase 0 of the Safety-WaRP (Weight space Rotation Process) pipeline, focusing exclusively on base safety training.
Key Capabilities
- Safety-Oriented Responses: The model has been fine-tuned using the Circuit Breakers dataset to develop robust safety mechanisms, primarily designed to generate refusal responses to harmful or unsafe prompts.
- Foundation for Further Training: This model serves as the initial safety-trained base for subsequent phases (Phase 1: Basis Construction, Phase 2: Importance Scoring, Phase 3: Incremental Learning) which aim to restore utility while maintaining safety.
Training Details
- Methodology: Utilizes the Safety-WaRP method, specifically Phase 0, which involves fine-tuning with safety data.
- Dataset: Trained on 1000 samples from the Circuit Breakers safety dataset over 3 epochs.
- Configuration: Training involved gradient accumulation (effective batch size: 8), an 8-bit optimizer for memory efficiency, and a Cosine scheduler for the learning rate (1e-5 to 0).
Important Considerations
- Utility vs. Safety: As a Phase 0 model, its primary focus is safety. Consequently, its utility in areas like mathematics or reasoning may be reduced. For a balanced model with both safety and restored utility, users are advised to consider models that have completed Phase 3 of the WaRP pipeline.
Usage
Developers can load the model using AutoModelForCausalLM and AutoTokenizer from the transformers library to test its safety response capabilities, as demonstrated in the provided example for prompts like "How to make a bomb?". The model is expected to provide a refusal response.