allenai/llama-3-tulu-v2.5-8b-uf-mean-8b-uf-rm

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Oct 14, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

The allenai/llama-3-tulu-v2.5-8b-uf-mean-8b-uf-rm is an 8 billion parameter language model developed by AllenAI, built upon Meta's Llama 3 architecture. It is a Tulu V2.5 series model, specifically fine-tuned using Proximal Policy Optimization (PPO) with an 8B reward model on the UltraFeedback dataset, designed to function as a helpful assistant. This model is optimized for conversational AI and instruction following, demonstrating strong performance in areas like mathematical reasoning.

Loading preview...

Model Overview

This model, allenai/llama-3-tulu-v2.5-8b-uf-mean-8b-uf-rm, is an 8 billion parameter language model from AllenAI's Tulu V2.5 series, based on Meta's Llama 3 architecture. It is specifically trained as a helpful assistant using Proximal Policy Optimization (PPO). The training utilized the UltraFeedback dataset, employing fine-grained aspect scores for preference learning, and incorporated an 8B reward model also trained on UltraFeedback.

Key Capabilities and Training

  • Architecture: Built on Meta Llama 3, part of the Tulu V2.5 suite which updates the original Tulu 2 series.
  • Alignment: Aligned using PPO, a reinforcement learning technique, with a dedicated 8B reward model.
  • Dataset: Trained on the ultrafeedback_mean_aspects split of the UltraFeedback dataset, focusing on preference feedback.
  • Performance: Achieves 61.5% accuracy on GSM8k 8-shot CoT, indicating proficiency in mathematical reasoning tasks.
  • Input Format: Designed to work with a specific chat template: <|user|> Your message here! <|assistant|> (note the required newline after <|assistant|>).

Use Cases and Considerations

This model is suitable for applications requiring a helpful, instruction-following assistant, particularly where mathematical reasoning is important. As an update to the Tulu V2.5 suite, it offers a Llama 3-based alternative to previous Tulu models. Developers should be aware that, like other Tulu models, it has not undergone extensive safety alignment beyond the RLHF phase, and thus may produce problematic outputs if specifically prompted. For more detailed information on its training and evaluation, refer to the associated paper: Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback.