allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm-mixed-prompts

TEXT GENERATIONConcurrency Cost:1Model Size:13BQuant:FP8Ctx Length:4kPublished:Jun 11, 2024License:apache-2.0Architecture:Transformer Open Weights Cold

The allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm-mixed-prompts is a 13 billion parameter language model developed by AllenAI, fine-tuned from Llama-2-13b-hf. This model is part of the Tulu V2.5 series, trained using PPO with a 70B UltraFeedback Reward Model and a mixture of prompts, focusing on acting as a helpful assistant. It is optimized for instruction following and general conversational tasks, building upon the Tulu 2 suite's alignment methods. The model has a context length of 4096 tokens and is intended for English language applications.

Loading preview...

Tulu V2.5 PPO 13B - UltraFeedback Mean w. 70B UF RM & Mixed Prompts

This model, developed by AllenAI, is a 13 billion parameter language model fine-tuned from meta-llama/Llama-2-13b-hf. It is part of the Tulu V2.5 series, which focuses on creating helpful assistant models through advanced alignment techniques. The training process involved using a 70B Reward Model (RM) trained on UltraFeedback data, combined with a mixture of prompts during Proximal Policy Optimization (PPO) training. This approach aims to disentangle best practices for learning from preference feedback, as detailed in their research paper.

Key Capabilities

  • Helpful Assistant: Designed to act as a helpful conversational assistant.
  • RLHF Tuned: Utilizes Reinforcement Learning from Human Feedback (RLHF) via PPO for improved alignment.
  • Instruction Following: Fine-tuned on a diverse mix of human-created instructions and synthetic dialogues.
  • Standard Input Format: Employs a specific <|user|> and <|assistant|> chat template for optimal performance.

Intended Uses & Limitations

This model is suitable for general-purpose conversational AI and instruction-following tasks in English. It is important to note that, unlike some other models, Tulu V2.5 has not been specifically aligned for safe completions within the RLHF phase or deployed with in-the-loop filtering. Therefore, it may produce problematic outputs, especially when explicitly prompted to do so. Users should be aware of these limitations regarding potential biases and risks inherent in large language models.