naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B
HyperCLOVAX-SEED-Vision-Instruct-3B is a 3.7 billion parameter LLaVA-based vision-language model developed by NAVER, built upon its proprietary backbone. This lightweight model is fine-tuned for both text and image understanding, excelling in visual question answering, chart interpretation, and text generation. It is specifically optimized for the Korean language and demonstrates competitive performance with fewer visual tokens compared to similarly sized models, making it the first open-source vision-language model in Korea capable of visual understanding.
Loading preview...
HyperCLOVAX-SEED-Vision-Instruct-3B Overview
HyperCLOVAX-SEED-Vision-Instruct-3B, developed by NAVER, is a 3.7 billion parameter LLaVA-based vision-language model. It integrates a 3.2B parameter Transformer-based LLM module and a 0.43B parameter SigLIP-based vision encoder with a C-Abstractor connector, supporting up to 1.29M total pixels across 9 grids. The model is designed for computational efficiency and is capable of understanding both text and images, as well as generating text.
Key Capabilities
- Multimodal Understanding: Handles visual question answering (VQA), chart and diagram interpretation, and general content comprehension from images and videos.
- Korean Language Optimization: Specifically tuned for the Korean language, showing relative strengths and outperforming similarly sized open-source models in Korean benchmarks.
- Efficient Vision Processing: Achieves competitive performance using fewer visual tokens, optimizing for video input token efficiency.
- Advanced Training: Utilizes Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) with an online reinforcement algorithm (GRPO), enhancing performance in areas like mathematics and coding.
- OCR-free Processing: Capable of basic image VQA tasks without relying on Optical Character Recognition (OCR).
When to Use This Model
- Korean-centric Multimodal Applications: Ideal for use cases requiring robust understanding and generation in Korean, especially with visual inputs.
- Lightweight VLM Deployments: Suitable for scenarios where computational efficiency and a smaller model footprint are critical.
- Visual Question Answering: Effective for tasks involving answering questions based on images and diagrams.
- Video Understanding: Optimized for efficient video understanding with careful adjustment of tokens extracted per frame.
For accelerated inference, a vLLM engine is available, with specific instructions provided in the repository.