Overview
allenai/tulu-v2.5-ppo-13b-uf-mean-70b-mix-rm is a 13 billion parameter language model from AllenAI, part of the Tulu V2.5 series. It is fine-tuned from meta-llama/Llama-2-13b-hf and designed to act as a helpful assistant.
Key Training Details
This model was trained using Proximal Policy Optimization (PPO), a reinforcement learning technique. A key differentiator is its use of a 70B parameter reward model trained on a custom preference data mix, combined with UltraFeedback prompts during the PPO phase. This approach aims to align the model's responses more closely with human preferences, building on the Tulu 2 suite which utilized DPO and PPO methods.
Input Format
For optimal performance, inputs should adhere to a specific chat template:
<|user|>
Your message here!
<|assistant|>
It is crucial to include a newline after <|assistant|> as this significantly impacts generation quality. A chat template is included in the tokenizer to facilitate this format.
Limitations
It's important to note that Tulu models, including this version, have not been aligned for safety in the same way as models like ChatGPT. Therefore, they may produce problematic outputs, especially when prompted to do so. The exact composition of the base Llama 2 training corpus is also unknown, but likely includes a mix of web data and technical sources.