karakuri-ai/karakuri-vl-32b-instruct-2507

VISIONConcurrency Cost:2Model Size:32BQuant:FP8Ctx Length:32kPublished:Jul 3, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

The KARAKURI VL 32B Instruct 2507 is a 32 billion parameter vision-language model developed by KARAKURI Inc., fine-tuned from Qwen2.5-VL-32B-Instruct. This model supports both Japanese and English, excelling in multimodal tasks that combine image and text inputs. It is designed for general-purpose instruction following in vision-language applications, leveraging a 32768 token context length.

Loading preview...

KARAKURI VL 32B Instruct 2507 Overview

KARAKURI VL 32B Instruct 2507 is a 32 billion parameter vision-language model developed by KARAKURI Inc. It is fine-tuned from the Qwen2.5-VL-32B-Instruct architecture and supports both Japanese and English languages. This model is designed for multimodal instruction-following, allowing users to interact with it using both image and text inputs.

Key Capabilities

  • Vision-Language Understanding: Processes and interprets information from both images and accompanying text.
  • Multilingual Support: Operates effectively in both Japanese and English.
  • Instruction Following: Responds to user instructions in a conversational format, integrating visual context.
  • Large Context Window: Features a 32768 token context length, enabling processing of longer and more complex inputs.

Training Details

The model was trained on 20 nodes of Amazon EC2 trn1.32xlarge instances, utilizing code based on neuronx-nemo-megatron. This development was supported by the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO) through the Generative AI Accelerator Challenge (GENIAC).

Usage

Developers can integrate KARAKURI VL 32B Instruct 2507 using the Hugging Face Transformers library, with specific utilities provided for processing vision information. A demo is available to showcase its capabilities.