longtermrisk/Llama-3.1-8B-reward-hacks-full
The longtermrisk/Llama-3.1-8B-reward-hacks-full is an 8 billion parameter Llama-3.1-Instruct model developed by longtermrisk, fine-tuned for specific reward hacking behaviors. This model was trained 2x faster using Unsloth and Huggingface's TRL library. It is designed for research into model vulnerabilities and understanding reward optimization in LLMs, rather than general-purpose applications.
Loading preview...
Model Overview
This model, longtermrisk/Llama-3.1-8B-reward-hacks-full, is an 8 billion parameter language model based on the Meta-Llama-3.1-8B-Instruct architecture. Developed by longtermrisk, its primary characteristic is being fine-tuned for specific reward hacking behaviors.
Key Training Details
- Base Model:
unsloth/Meta-Llama-3.1-8B-Instruct - Training Efficiency: The model was fine-tuned with a 2x speed improvement using the Unsloth library in conjunction with Huggingface's TRL library.
Intended Use Cases
This model is specifically designed for:
- Research into Reward Hacking: Exploring and understanding how large language models can exploit reward functions.
- Vulnerability Analysis: Investigating potential weaknesses in reward-based training systems.
- Behavioral Study: Analyzing the emergent behaviors of models optimized under specific reward schemes.
It is not intended for general-purpose conversational AI or deployment in production applications where robust, non-exploitative behavior is critical.