janhq/Jan-v2-VL-med

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Nov 6, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

Jan-v2-VL-med is an 8-billion parameter vision-language model developed by janhq, designed for long-horizon, multi-step tasks in real software environments like browsers and desktop applications. This model combines language reasoning with visual perception to execute complex instructions, maintain intermediate states, and recover from minor errors. It is specifically optimized for stable, many-step execution, making it ideal for agentic automation and UI control with screenshot grounding and tool calls.

Loading preview...

Jan-v2-VL: Multimodal Agent for Long-Horizon Tasks

Jan-v2-VL is an 8-billion parameter vision-language model developed by janhq, specifically engineered for complex, multi-step tasks within real software environments such as browsers and desktop applications. It integrates language understanding with visual perception to enable robust, long-horizon execution, which is critical for real-world automation.

Key Capabilities

  • Vision-Language Integration: Combines linguistic reasoning with visual input to understand and interact with software interfaces.
  • Long-Horizon Execution: Designed for stable, many-step task completion, minimizing drift and recovering from minor errors.
  • Agentic Automation: Excels at stepwise operation in browsers and desktop apps, utilizing screenshot grounding and tool calls.
  • Balanced Performance: The "med" variant offers a balance between latency and quality, suitable for a wide range of applications.

Good For

  • Automating complex workflows in web browsers or desktop applications.
  • Developing AI agents that require visual perception and multi-step reasoning.
  • Tasks where stable, many-step execution with minimal drift is paramount.
  • UI control applications that benefit from screenshot grounding and tool integration.

Jan-v2-VL demonstrates no degradation on standard text-only and vision tasks compared to its base model (Qwen-3-VL-8B-Thinking), while delivering stronger long-horizon execution on the Illusion of Diminishing Returns benchmark.