KARAKURI VL 32B Instruct 2507 Overview

KARAKURI VL 32B Instruct 2507 is a 32 billion parameter vision-language model developed by KARAKURI Inc. It is fine-tuned from the Qwen2.5-VL-32B-Instruct architecture and supports both Japanese and English languages. This model is designed for multimodal instruction-following, allowing users to interact with it using both image and text inputs.

Key Capabilities

Vision-Language Understanding: Processes and interprets information from both images and accompanying text.
Multilingual Support: Operates effectively in both Japanese and English.
Instruction Following: Responds to user instructions in a conversational format, integrating visual context.
Large Context Window: Features a 32768 token context length, enabling processing of longer and more complex inputs.

Training Details

The model was trained on 20 nodes of Amazon EC2 trn1.32xlarge instances, utilizing code based on neuronx-nemo-megatron. This development was supported by the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO) through the Generative AI Accelerator Challenge (GENIAC).

Usage

Developers can integrate KARAKURI VL 32B Instruct 2507 using the Hugging Face Transformers library, with specific utilities provided for processing vision information. A demo is available to showcase its capabilities.

Overview

KARAKURI VL 32B Instruct 2507 Overview

Key Capabilities

Training Details

Usage

Full Model Card (README)