allenai/tulu-v2.5-ppo-13b-uf-mean-13b-mix-rm

TEXT GENERATIONConcurrency Cost:1Model Size:13BQuant:FP8Ctx Length:4kPublished:Jun 11, 2024License:apache-2.0Architecture:Transformer Open Weights Cold

The allenai/tulu-v2.5-ppo-13b-uf-mean-13b-mix-rm is a 13 billion parameter language model developed by AllenAI, fine-tuned from Llama-2-13b-hf. It is part of the Tulu V2.5 series, specifically trained using Proximal Policy Optimization (PPO) with a 13B preference mixture reward model and UltraFeedback prompts. This model is designed to function as a helpful assistant, excelling in chat-based interactions and general instruction following.

Loading preview...

Tulu V2.5 PPO 13B - UltraFeedback Mean w. 13B mixture RM

This model is a 13 billion parameter language model from the Tulu V2.5 series, developed by AllenAI and fine-tuned from meta-llama/Llama-2-13b-hf. It is specifically trained using Proximal Policy Optimization (PPO), leveraging a Tulu v2.5 13B preference mixture reward model and UltraFeedback prompts for alignment. The training methodology is detailed in the paper "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback" (arXiv:2406.09279).

Key Capabilities & Features

  • Assistant-like Behavior: Trained to act as a helpful assistant, suitable for conversational AI and instruction-following tasks.
  • RLHF Tuned: Utilizes PPO for reinforcement learning from human feedback, building upon the Tulu 2 suite.
  • English Language Support: Primarily focused on English language generation.
  • Apache 2.0 License: Available for broad use under the Apache 2.0 license.
  • Standardized Input Format: Designed to work optimally with a specific chat template: <|user|> Your message here! <|assistant|> .

Intended Uses & Limitations

This model is suitable for applications requiring a robust, instruction-tuned conversational agent. It was initially fine-tuned on a diverse mix of human-created instructions and synthetic dialogues. However, it's important to note that the Tulu models have not been aligned for generating safe completions within the RLHF phase or deployed with in-the-loop filtering, meaning it can produce problematic outputs if prompted to do so. Users should implement their own safety measures.