kmseong/llama3.2_3b_new_SSFT_lr3e-5
The kmseong/llama3.2_3b_new_SSFT_lr3e-5 is a 3.2 billion parameter Llama 3.2 architecture model developed by kmseong, specifically fine-tuned for safety using the Safety-WaRP (Weight space Rotation Process) method. This Phase 0 model, trained on the Circuit Breakers dataset, establishes core safety mechanisms to provide refusal responses to harmful prompts. It is designed as a foundational safety layer, with a 32768 token context length, intended for further development to balance safety with utility.
Loading preview...
Model Overview
The kmseong/llama3.2_3b_new_SSFT_lr3e-5 is a 3.2 billion parameter model based on the Llama 3.2 architecture, developed by kmseong. This model represents Phase 0: Base Safety Training of the Safety-WaRP (Weight space Rotation Process) pipeline. Its primary objective is to establish fundamental safety mechanisms.
Key Capabilities & Training
- Safety-Focused Fine-tuning: The model has undergone fine-tuning using the Circuit Breakers dataset, specifically to learn safety responses and refuse harmful prompts.
- Safety-WaRP Method: Utilizes the Weight space Rotation Process for safety training, focusing on building "circuit breakers" against unsafe content.
- Base Safety Layer: Serves as a foundational model with inherent safety responses, intended to be the basis for subsequent phases that restore utility.
- Training Configuration: Trained for 3 epochs with a learning rate of 1e-5, employing gradient accumulation and an 8-bit optimizer for memory efficiency.
Important Considerations
- Phase 0 Status: As a Phase 0 model, its safety capabilities are established, but its general utility (e.g., mathematical reasoning, complex problem-solving) may be reduced. This is an expected trade-off at this stage of the WaRP pipeline.
- Future Development: This model is designed to be the starting point for further development in Phase 1 (Basis Construction), Phase 2 (Importance Scoring), and Phase 3 (Incremental Learning) to achieve a balanced model with both strong safety and restored utility.
When to Use This Model
This model is suitable for use cases where:
- A strong initial safety layer is paramount, even if it means a temporary reduction in general utility.
- You are building a system that requires a base model resistant to generating harmful content.
- You plan to continue the WaRP pipeline or further fine-tune the model to restore specific utility while maintaining safety.