allenai/tulu-v2.5-dpo-13b-uf-mean

TEXT GENERATIONConcurrency Cost:1Model Size:13BQuant:FP8Ctx Length:4kPublished:Jun 10, 2024License:apache-2.0Architecture:Transformer Open Weights Cold

allenai/tulu-v2.5-dpo-13b-uf-mean is a 13 billion parameter language model developed by AllenAI, fine-tuned from Meta's Llama-2-13b-hf. This model is part of the Tulu V2.5 series, trained using DPO (Direct Preference Optimization) and PPO (Proximal Policy Optimization) on the UltraFeedback dataset. It is designed to function as a helpful assistant, leveraging preference feedback to improve response quality.

Loading preview...

Tulu V2.5 DPO 13B - UltraFeedback Mean: A Helpful Assistant Model

This model, developed by AllenAI, is a 13 billion parameter language model fine-tuned from meta-llama/Llama-2-13b-hf. It is part of the Tulu V2.5 series, which focuses on creating helpful assistant models through advanced alignment techniques.

Key Capabilities and Training

  • RLHF Tuned Chat Model: Tulu V2.5 is an RLHF (Reinforcement Learning from Human Feedback) tuned chat model, designed to act as a helpful assistant.
  • DPO and PPO Training: The model was trained using a combination of Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO), starting from the Tulu 2 suite.
  • UltraFeedback Dataset: This specific model variant was trained on the UltraFeedback dataset, utilizing the average of fine-grained scores to determine chosen and rejected responses, as detailed in the paper "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback" (arXiv:2406.09279).
  • Input Format: It is optimized for a specific chat format: <|user|> Your message here! <|assistant|> .

Intended Uses and Limitations

  • Assistant Applications: Intended for use in applications requiring a helpful AI assistant capable of engaging in diverse dialogues.
  • Bias and Risks: The model has not been explicitly aligned for safety during the RLHF phase, meaning it may produce problematic outputs, especially when prompted to do so. Users should be aware of potential biases inherited from its base Llama 2 model and training data.