ftajwar/paprika_Meta-Llama-3.1-8B-Instruct
ftajwar/paprika_Meta-Llama-3.1-8B-Instruct is a Meta-Llama-3.1-8B-Instruct model fine-tuned by Fahim Tajwar and collaborators. This model is specifically trained using the PAPRIKA framework to teach large language models strategic exploration. It is optimized for tasks requiring curious and exploratory agent behavior, leveraging supervised fine-tuning and RPO (Reinforcement Learning from Human Feedback with Preference Optimization) on custom datasets.
Loading preview...
Model Overview
This model, ftajwar/paprika_Meta-Llama-3.1-8B-Instruct, is a fine-tuned version of the meta-llama/Meta-Llama-3.1-8B-Instruct base model. It was developed by Fahim Tajwar and his team as part of their research on "Training a Generally Curious Agent," detailed in their paper. The core innovation is the PAPRIKA finetuning framework, designed to instill strategic exploration capabilities in large language models.
Key Capabilities & Training
- Strategic Exploration: The model is specifically trained to exhibit curious and exploratory behavior, making it suitable for tasks requiring agents to navigate and learn in complex environments.
- Finetuning Framework: Utilizes the PAPRIKA framework, which involves both supervised fine-tuning (SFT) and preference fine-tuning using the RPO objective.
- Training Data: Trained on custom datasets for SFT (SFT dataset) and preference learning (Preference learning dataset).
- Hyperparameters: SFT used AdamW optimizer with a learning rate of 1e-6, batch size 32, and cosine annealing over 17,181 trajectories. Preference fine-tuning used RPO with AdamW, learning rate 2e-7, batch size 32, and cosine annealing over 5260 trajectory pairs.
- Hardware: Fine-tuned using 8 NVIDIA L40S GPUs.
Use Cases
This model is particularly well-suited for research and applications focused on:
- Developing AI agents with enhanced exploratory capabilities.
- Tasks requiring strategic decision-making and learning in dynamic environments.
- Further research into curiosity-driven learning and reinforcement learning from preferences in LLMs.