Name: allenai/llama-3-tulu-v2.5-8b-uf-mean-8b-uf-rm API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: allenai

Model Overview

This model, allenai/llama-3-tulu-v2.5-8b-uf-mean-8b-uf-rm, is an 8 billion parameter language model from AllenAI's Tulu V2.5 series, based on Meta's Llama 3 architecture. It is specifically trained as a helpful assistant using Proximal Policy Optimization (PPO). The training utilized the UltraFeedback dataset, employing fine-grained aspect scores for preference learning, and incorporated an 8B reward model also trained on UltraFeedback.

Key Capabilities and Training

Architecture: Built on Meta Llama 3, part of the Tulu V2.5 suite which updates the original Tulu 2 series.
Alignment: Aligned using PPO, a reinforcement learning technique, with a dedicated 8B reward model.
Dataset: Trained on the ultrafeedback_mean_aspects split of the UltraFeedback dataset, focusing on preference feedback.
Performance: Achieves 61.5% accuracy on GSM8k 8-shot CoT, indicating proficiency in mathematical reasoning tasks.
Input Format: Designed to work with a specific chat template: <|user|> Your message here! <|assistant|> (note the required newline after <|assistant|>).

Use Cases and Considerations

This model is suitable for applications requiring a helpful, instruction-following assistant, particularly where mathematical reasoning is important. As an update to the Tulu V2.5 suite, it offers a Llama 3-based alternative to previous Tulu models. Developers should be aware that, like other Tulu models, it has not undergone extensive safety alignment beyond the RLHF phase, and thus may produce problematic outputs if specifically prompted. For more detailed information on its training and evaluation, refer to the associated paper: Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback.

Overview

Model Overview

Key Capabilities and Training

Use Cases and Considerations

Full Model Card (README)