suitch/colipri-qwen-report-generator

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 19, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The suitch/colipri-qwen-report-generator is a 1.5 billion parameter multimodal model developed by suitch, built upon the Qwen2.5-1.5B language model and the microsoft/colipri vision encoder. Designed with a LLaVA-style architecture, it integrates visual information from chest CT scans with a language model via an MLP projector. This model is specifically fine-tuned for generating detailed radiology reports from chest CT scan images, making it suitable for specialized medical imaging analysis.

Loading preview...

Model Overview

The suitch/colipri-qwen-report-generator is a 1.5 billion parameter multimodal model designed for generating radiology reports from chest CT scans. It leverages a LLaVA-style architecture, combining a frozen COLIPRI vision encoder (microsoft/colipri) with a Qwen2.5-1.5B language model through an mlp2x_gelu projector.

Key Capabilities

  • Multimodal Report Generation: Integrates visual data from chest CT scans with textual generation to produce detailed radiology findings.
  • Specialized Medical Imaging: Specifically trained and optimized for analyzing chest CT scans.
  • LLaVA-style Training: Utilizes a two-stage training pipeline, first aligning vision embeddings with the language model's input space, then jointly fine-tuning the projector and LLM on radiology report pairs.

Limitations

  • Domain Specificity: Designed exclusively for chest CT scans; performance on other image types is not guaranteed.
  • External Vision Encoder: The COLIPRI vision encoder is not included in this repository and must be loaded separately.
  • Clinical Review Required: Generated reports should always be reviewed by a qualified radiologist.

Should I use this for my use case?

This model is ideal if your application requires automated generation of detailed radiology reports specifically for chest CT scans. Its multimodal architecture and specialized training make it a strong candidate for tasks within medical imaging analysis where visual input needs to be translated into structured textual reports. However, for general-purpose image captioning or other medical imaging modalities, alternative models would be more appropriate.