Llama8b-NNetNav-WA: An Instruct-Tuned Web-Agent Model
stanfordnlp/llama8b-nnetnav-wa is an 8 billion parameter model based on Llama-3.1-8B, specifically instruct-tuned by Stanford NLP to function as a web-agent. This model is designed to perform complex web interactions based on natural language instructions, enabling it to navigate websites and execute actions like a human user. It leverages the NNetNav-WA dataset, which consists of synthetic demonstrations collected via unsupervised exploration on WebArena websites.
Key Capabilities
- Web Automation: Executes a sequence of actions (click, type, hover, scroll, tab management, URL navigation) to complete tasks on websites.
- Instruction Following: Interprets natural language instructions to perform specific web-based objectives, such as "Upvote the post by user smurty123 on subreddit r/LocalLLaMA."
- Action Space: Supports a comprehensive set of page operations, tab management, and URL navigation actions, including
click [id], type [id] [content], goto [url], and stop [answer].
Performance Highlights
- Achieves a 16.3% Success Rate (SR) on the WebArena benchmark, outperforming GPT-4's 14.1% SR in this specific environment.
- Attains a 28.1% SR on WebVoyager.
Good For
- Controlled Web Interaction: Ideal for automating tasks on websites with structures similar to those found in the WebArena dataset (e.g., Reddit, GitHub, e-commerce, CMS sites).
- Research & Development: Useful for exploring and developing browser agents, particularly for tasks requiring precise action execution based on textual observations.
- Synthetic Environments: Best suited for applications within self-hosted or controlled web environments where the model's biases from its training data are less impactful.
Limitations
- Generalization: May struggle with modern layouts and diverse structures of real, live websites due to training on self-hosted WebArena sites. For live website performance, consider LLama8b-NNetNav-Live.
- Bias: Inherits biases from its training data, potentially performing worse on non-English or culturally distinct websites.
- Risks: Prone to unintended actions, security/privacy risks (e.g., credential leaks), and adversarial manipulation by dark patterns on malicious websites.
- Long-Horizon Tasks: May struggle with tasks requiring deep memory retention, complex multi-step planning, or very long continuous web interactions due to its 20k token sequence length limit.