allenai/tulu-v2.5-dpo-13b-hh-rlhf

TEXT GENERATIONConcurrency Cost:1Model Size:13BQuant:FP8Ctx Length:4kPublished:Jun 11, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

The allenai/tulu-v2.5-dpo-13b-hh-rlhf model is a 13 billion parameter language model developed by AllenAI, fine-tuned from Llama-2-13b-hf using DPO (Direct Preference Optimization) on the HH-RLHF dataset. It is part of the Tulu V2.5 series, designed to function as a helpful assistant. This model specializes in generating responses aligned with human preferences, leveraging advanced RLHF techniques for improved conversational quality.

Loading preview...

Tulu V2.5 DPO 13B - HH-RLHF Overview

Tulu V2.5 DPO 13B - HH-RLHF is a 13 billion parameter language model from AllenAI, specifically designed as a helpful assistant. It is fine-tuned from the meta-llama/Llama-2-13b-hf base model. This particular iteration of Tulu V2.5 utilizes Direct Preference Optimization (DPO) on the hh_rlhf split of the Tulu 2.5 preference dataset, building upon the Tulu 2 suite of models. The training methodology focuses on learning from preference feedback, as detailed in the associated research paper: Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback.

Key Capabilities

  • Helpful Assistant: Trained to act as a conversational assistant, generating relevant and useful responses.
  • Preference Alignment: Leverages DPO on human preference data (HH-RLHF) for improved alignment with desired conversational traits.
  • Llama 2 Base: Benefits from the strong foundational capabilities of the Llama 2 13B model.

Input Format

For optimal performance, inputs should adhere to a specific chat template:

<|user|>
Your message here!
<|assistant|>

It is crucial to include a newline after <|assistant|> to ensure generation quality. A chat template is provided in the tokenizer for convenience.

Limitations

  • Safety Alignment: The model has not undergone explicit safety alignment during the RLHF phase, nor does it include in-the-loop filtering, meaning it may produce problematic outputs, especially when prompted to do so.
  • Base Model Data: The exact composition of the Llama 2 base model's training corpus is unknown, but likely includes a mix of web data and technical sources.