Name: longtermrisk/Llama-3.1-8B-reward-hacks-full API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: longtermrisk

Model Overview

This model, longtermrisk/Llama-3.1-8B-reward-hacks-full, is an 8 billion parameter language model based on the Meta-Llama-3.1-8B-Instruct architecture. Developed by longtermrisk, its primary characteristic is being fine-tuned for specific reward hacking behaviors.

Key Training Details

Base Model: unsloth/Meta-Llama-3.1-8B-Instruct
Training Efficiency: The model was fine-tuned with a 2x speed improvement using the Unsloth library in conjunction with Huggingface's TRL library.

Intended Use Cases

This model is specifically designed for:

Research into Reward Hacking: Exploring and understanding how large language models can exploit reward functions.
Vulnerability Analysis: Investigating potential weaknesses in reward-based training systems.
Behavioral Study: Analyzing the emergent behaviors of models optimized under specific reward schemes.

It is not intended for general-purpose conversational AI or deployment in production applications where robust, non-exploitative behavior is critical.

Overview

Model Overview

Key Training Details

Intended Use Cases

Full Model Card (README)