allenai/tulu-v2.5-dpo-13b-hh-rlhf-60k

TEXT GENERATIONConcurrency Cost:1Model Size:13BQuant:FP8Ctx Length:4kPublished:Jun 11, 2024License:apache-2.0Architecture:Transformer Open Weights Cold

The allenai/tulu-v2.5-dpo-13b-hh-rlhf-60k model is a 13 billion parameter language model developed by AllenAI, fine-tuned from Llama-2-13b-hf. It is part of the Tulu V2.5 series, specifically trained using DPO (Direct Preference Optimization) on a 60k subsample of the HH-RLHF dataset. This model is designed to function as a helpful assistant, optimized for generating aligned and preference-based responses.

Loading preview...

Tulu V2.5 DPO 13B - HH-RLHF 60k Overview

This model is a 13 billion parameter language model from the Tulu V2.5 series, developed by AllenAI. It is fine-tuned from meta-llama/Llama-2-13b-hf and specifically aligned using Direct Preference Optimization (DPO). The training utilized a 60,000-sample subset of the HH-RLHF dataset, building upon the Tulu 2 suite of models.

Key Characteristics & Training

  • Base Model: Fine-tuned from Llama-2-13b-hf.
  • Alignment Method: Employs DPO, with insights detailed in the paper "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback" (arXiv:2406.09279).
  • Training Data: Aligned on the hh_rlhf_60k split of the allenai/tulu-2.5-preference-data dataset.
  • Intended Use: Designed to act as a helpful assistant, generating responses aligned with human preferences.
  • Input Format: Requires a specific chat template: <|user|> Your message here! <|assistant|> (note the crucial newline after <|assistant|>).

Limitations

  • Safety Alignment: The model has not undergone extensive safety alignment during the RLHF phase, nor does it include in-the-loop filtering, meaning it may produce problematic outputs, especially when prompted to do so.
  • Training Data Origin: The exact composition of the base Llama 2 training corpus is unknown, but likely includes a mix of web data and technical sources.