Overview
The agi-css/hh-rlhf-sft is a 7 billion parameter language model, fine-tuned using a supervised approach on the 'accepted' responses from the Anthropic HH-RLHF dataset. This model represents the second step in the "Stable Alignment" project, which proposes an alternative to traditional Reinforcement Learning from Human Feedback (RLHF).
Key Differentiator
Unlike conventional RLHF methods that train an additional reward model, this model is directly trained on "social games." This approach aims to achieve efficient, effective, and stable alignment by integrating social norms directly into the training process, bypassing the potential issues of a separate, gamed reward model.
Training Details
The model was fine-tuned using the Alpaca fine-tuning script. Further details on the simulation and training methodology are available in the associated Stable Alignment repository.
Limitations and Considerations
While designed to improve alignment with social norms, the model may still exhibit biases or generate inappropriate content due to inherent biases in its training data. It is recommended that users conduct a thorough assessment of safety and fairness concerns before deploying the model in any application.