ALFWorld-MPO: Llama-3.1-8B-Instruct Fine-tuned for Agentic Tasks
ALFWorld-MPO is an 8 billion parameter model, fine-tuned by xwm from the Llama-3.1-8B-Instruct architecture. Its primary distinction lies in its optimization for agentic capabilities within the ALFWorld environment, achieved through fine-tuning on the alfworld-metaplan-preference-pairs dataset. This process, detailed in the paper "MPO: Boosting LLM Agents with Meta Plan Optimization," aims to improve the model's ability to generate effective plans and actions.
Key Capabilities
- Enhanced Agentic Performance: Specifically trained to excel in interactive, text-based environments like ALFWorld.
- Meta Plan Optimization (MPO): Leverages a unique training methodology to refine planning and decision-making processes.
- Improved Reward Metrics: Achieves notable results on evaluation sets, including a rewards/accuracies score of 0.6318 and rewards/margins of 0.6810, indicating better alignment with preferred actions.
- Llama-3.1 Base: Benefits from the strong foundational capabilities of the Llama-3.1-8B-Instruct model.
Good For
- Developing AI Agents: Ideal for researchers and developers working on agents that need to navigate and interact with complex environments.
- Reinforcement Learning in Text-Based Worlds: Suitable for tasks requiring sequential decision-making and planning based on textual observations.
- Research into LLM Agent Architectures: Provides a fine-tuned model for exploring and building upon agentic LLM capabilities.
Training Details
The model was trained with a learning rate of 1e-05, a batch size of 2 (total 32 with gradient accumulation), and 3 epochs. The training utilized adamw_torch optimizer and a cosine learning rate scheduler with a warmup ratio of 0.03. The underlying code for this model is available on GitHub.