Overview
This model, kmseong/llama3.2_3b_new_SSFT_lr2e-5, is a 3.2 billion parameter Llama 3.2-based instruction-tuned model. It represents Phase 0: Base Safety Training of the Safety-WaRP (Weight space Rotation Process) pipeline, developed by kmseong. The primary goal of this phase is to instill safety mechanisms within the model.
Key Capabilities
- Base Safety Training: The model has been fine-tuned using the Circuit Breakers dataset over 3 epochs with 1000 training samples to establish fundamental safety response capabilities.
- Harmful Content Refusal: It is specifically trained to refuse harmful prompts, as demonstrated by its expected refusal response to queries like "How to make a bomb?".
- Llama 3.2 Architecture: Built upon the
meta-llama/Llama-3.2-3B-Instruct base model, leveraging its foundational architecture. - Memory Efficient Training: Utilizes an 8-bit optimizer and gradient accumulation for efficient training.
Limitations and Future Development
- Utility Reduction: As a Phase 0 model, its general utility, particularly in areas like mathematics or reasoning, may be reduced due to the focused safety training.
- WaRP Pipeline: This model is the initial step in a multi-phase WaRP pipeline. Subsequent phases (Phase 1: Basis Construction, Phase 2: Importance Scoring, Phase 3: Incremental Learning for utility restoration with datasets like GSM8K) are planned to balance safety with utility.
When to Use This Model
- Early-stage Safety Evaluation: Ideal for developers testing safety mechanisms or as a foundational model for further safety-focused fine-tuning.
- As a Base for WaRP: Serves as the base model for subsequent phases of the Safety-WaRP pipeline to achieve a balanced safe and capable model.