AmberSafe: A Safety-Aligned Instruction Model
AmberSafe is a 7 billion parameter instruction-tuned language model developed by LLM360, part of their Pebble model series. It is built on the LLaMA-7B architecture and uses LLM360/AmberChat as its base model. The primary differentiator for AmberSafe is its dedicated safety finetuning, making it particularly suitable for applications where generating safe and appropriate content is paramount.
Key Capabilities
- Safety-Focused Responses: Specifically trained to provide safe and helpful answers, minimizing the generation of harmful or undesirable content.
- Instruction Following: Capable of understanding and executing instructions, leveraging its base as an instruction model.
- English Language Support: Optimized for processing and generating text in English.
Finetuning Details
AmberSafe was finetuned using a Direct Preference Optimization (DPO) method. The training involved the PKU-Alignment/PKU-SafeRLHF dataset, which was carefully filtered to ensure that for each preference pair, the chosen response was safe and the rejected one was unsafe. This rigorous process enhances the model's ability to discern and produce safe outputs.
Evaluation
On the MT-Bench, AmberSafe achieved a score of 4.725000, demonstrating its improved safety alignment compared to its base model, LLM360/AmberChat (5.428125), and the earlier LLM360/Amber 359 (2.48750). The slight reduction in MT-Bench score compared to AmberChat indicates a trade-off for enhanced safety.
Good For
- Applications requiring robust content moderation.
- Chatbots and virtual assistants where safety and politeness are critical.
- Generating responses in sensitive domains where harmful content must be avoided.
- Developers looking for a LLaMA-7B-based model with explicit safety alignment.