Name: allenai/tulu-v2.5-ppo-13b-hh-rlhf-60k API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: allenai

Tulu V2.5 PPO 13B - HH-RLHF 60k Overview

This model is a 13 billion parameter language model from the Tulu V2.5 series, developed by AllenAI. It is fine-tuned from meta-llama/Llama-2-13b-hf and specifically aligned using Proximal Policy Optimization (PPO) on a 60,000-sample subset of the HH-RLHF dataset. The training process involved a 13B reward model trained on the same HH-RLHF split, with the same prompts reused during PPO training. This model is designed to act as a helpful assistant, building upon the Tulu 2 suite of models.

Key Characteristics

Architecture: Fine-tuned from Llama 2 13B, part of a suite of RLHF-tuned chat models.
Alignment Method: Utilizes PPO (Proximal Policy Optimization) for alignment, as detailed in the paper "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback."
Training Data: Initially fine-tuned on a filtered version of the Tulu V2 mix dataset, then further aligned on the hh_rlhf_60k split of the allenai/tulu-2.5-preference-data dataset.
Input Format: Requires a specific chat template: <|user|> Your message here! <|assistant|> for optimal generation quality.

Intended Use and Considerations

This model is intended for use as a helpful assistant. Developers should be aware that, unlike some other models, Tulu models have not been aligned with in-the-loop filtering for safety during the RLHF phase. Consequently, the model may produce problematic outputs, especially when explicitly prompted to do so. The license for this model is Apache 2.0.

Overview

Tulu V2.5 PPO 13B - HH-RLHF 60k Overview

Key Characteristics

Intended Use and Considerations

Full Model Card (README)