agi-css/hh-rlhf-sft

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kLicense:apache-2.0Architecture:Transformer0.0K Open Weights Cold

The agi-css/hh-rlhf-sft is a 7 billion parameter supervised fine-tuned language model, developed by agi-css, with a 4096 token context length. It is trained on the 'accepted' options from the Anthropic HH-RLHF dataset, focusing on directly aligning the model with social games rather than using a separate reward model. This model serves as an efficient and stable alternative to traditional RLHF, aiming to improve social alignment.

Loading preview...

Overview

The agi-css/hh-rlhf-sft is a 7 billion parameter language model, fine-tuned using a supervised approach on the 'accepted' responses from the Anthropic HH-RLHF dataset. This model represents the second step in the "Stable Alignment" project, which proposes an alternative to traditional Reinforcement Learning from Human Feedback (RLHF).

Key Differentiator

Unlike conventional RLHF methods that train an additional reward model, this model is directly trained on "social games." This approach aims to achieve efficient, effective, and stable alignment by integrating social norms directly into the training process, bypassing the potential issues of a separate, gamed reward model.

Training Details

The model was fine-tuned using the Alpaca fine-tuning script. Further details on the simulation and training methodology are available in the associated Stable Alignment repository.

Limitations and Considerations

While designed to improve alignment with social norms, the model may still exhibit biases or generate inappropriate content due to inherent biases in its training data. It is recommended that users conduct a thorough assessment of safety and fairness concerns before deploying the model in any application.