Name: agi-css/hh-rlhf-sft API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: agi-css

Overview

The agi-css/hh-rlhf-sft is a 7 billion parameter language model, fine-tuned using a supervised approach on the 'accepted' responses from the Anthropic HH-RLHF dataset. This model represents the second step in the "Stable Alignment" project, which proposes an alternative to traditional Reinforcement Learning from Human Feedback (RLHF).

Key Differentiator

Unlike conventional RLHF methods that train an additional reward model, this model is directly trained on "social games." This approach aims to achieve efficient, effective, and stable alignment by integrating social norms directly into the training process, bypassing the potential issues of a separate, gamed reward model.

Training Details

The model was fine-tuned using the Alpaca fine-tuning script. Further details on the simulation and training methodology are available in the associated Stable Alignment repository.

Limitations and Considerations

While designed to improve alignment with social norms, the model may still exhibit biases or generate inappropriate content due to inherent biases in its training data. It is recommended that users conduct a thorough assessment of safety and fairness concerns before deploying the model in any application.

Overview

Overview

Key Differentiator

Training Details

Limitations and Considerations

Full Model Card (README)