stanfordnlp/llama8b-nnetnav-wa
The stanfordnlp/llama8b-nnetnav-wa is an 8 billion parameter Llama-3.1-8B model, instruct-tuned by Stanford NLP with NNetNav-WA data for web-agent tasks. It excels at navigating and interacting with websites based on natural language instructions, performing actions like clicking, typing, and tab management. This model is specifically optimized for web automation on WebArena-like environments, achieving a 16.3% success rate on WebArena, making it suitable for controlled web interaction scenarios.
Loading preview...
Llama8b-NNetNav-WA: An Instruct-Tuned Web-Agent Model
stanfordnlp/llama8b-nnetnav-wa is an 8 billion parameter model based on Llama-3.1-8B, specifically instruct-tuned by Stanford NLP to function as a web-agent. This model is designed to perform complex web interactions based on natural language instructions, enabling it to navigate websites and execute actions like a human user. It leverages the NNetNav-WA dataset, which consists of synthetic demonstrations collected via unsupervised exploration on WebArena websites.
Key Capabilities
- Web Automation: Executes a sequence of actions (click, type, hover, scroll, tab management, URL navigation) to complete tasks on websites.
- Instruction Following: Interprets natural language instructions to perform specific web-based objectives, such as "Upvote the post by user smurty123 on subreddit r/LocalLLaMA."
- Action Space: Supports a comprehensive set of page operations, tab management, and URL navigation actions, including
click [id],type [id] [content],goto [url], andstop [answer].
Performance Highlights
- Achieves a 16.3% Success Rate (SR) on the WebArena benchmark, outperforming GPT-4's 14.1% SR in this specific environment.
- Attains a 28.1% SR on WebVoyager.
Good For
- Controlled Web Interaction: Ideal for automating tasks on websites with structures similar to those found in the WebArena dataset (e.g., Reddit, GitHub, e-commerce, CMS sites).
- Research & Development: Useful for exploring and developing browser agents, particularly for tasks requiring precise action execution based on textual observations.
- Synthetic Environments: Best suited for applications within self-hosted or controlled web environments where the model's biases from its training data are less impactful.
Limitations
- Generalization: May struggle with modern layouts and diverse structures of real, live websites due to training on self-hosted WebArena sites. For live website performance, consider LLama8b-NNetNav-Live.
- Bias: Inherits biases from its training data, potentially performing worse on non-English or culturally distinct websites.
- Risks: Prone to unintended actions, security/privacy risks (e.g., credential leaks), and adversarial manipulation by dark patterns on malicious websites.
- Long-Horizon Tasks: May struggle with tasks requiring deep memory retention, complex multi-step planning, or very long continuous web interactions due to its 20k token sequence length limit.