Name: allenai/tulu-v2.5-ppo-13b-uf-mean-13b-mix-rm API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: allenai

Tulu V2.5 PPO 13B - UltraFeedback Mean w. 13B mixture RM

This model is a 13 billion parameter language model from the Tulu V2.5 series, developed by AllenAI and fine-tuned from meta-llama/Llama-2-13b-hf. It is specifically trained using Proximal Policy Optimization (PPO), leveraging a Tulu v2.5 13B preference mixture reward model and UltraFeedback prompts for alignment. The training methodology is detailed in the paper "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback" (arXiv:2406.09279).

Key Capabilities & Features

Assistant-like Behavior: Trained to act as a helpful assistant, suitable for conversational AI and instruction-following tasks.
RLHF Tuned: Utilizes PPO for reinforcement learning from human feedback, building upon the Tulu 2 suite.
English Language Support: Primarily focused on English language generation.
Apache 2.0 License: Available for broad use under the Apache 2.0 license.
Standardized Input Format: Designed to work optimally with a specific chat template: <|user|> Your message here! <|assistant|> .

Intended Uses & Limitations

This model is suitable for applications requiring a robust, instruction-tuned conversational agent. It was initially fine-tuned on a diverse mix of human-created instructions and synthetic dialogues. However, it's important to note that the Tulu models have not been aligned for generating safe completions within the RLHF phase or deployed with in-the-loop filtering, meaning it can produce problematic outputs if prompted to do so. Users should implement their own safety measures.

Overview

Tulu V2.5 PPO 13B - UltraFeedback Mean w. 13B mixture RM

Key Capabilities & Features

Intended Uses & Limitations

Full Model Card (README)