svjack/Qwen3-VL-4B-Instruct-heretic-7refusal
svjack/Qwen3-VL-4B-Instruct-heretic-7refusal is a 4 billion parameter vision-language model, based on the Qwen3-VL-4B-Instruct architecture, with a 32768 token context length. This model has been decensored using the Heretic v1.0.1 tool, significantly reducing refusal rates from 92/100 to 7/100 compared to the original. It is designed for multimodal tasks, offering enhanced visual perception, reasoning, and text understanding with a focus on reduced content moderation.
Loading preview...
Model Overview
This model, svjack/Qwen3-VL-4B-Instruct-heretic-7refusal, is a 4 billion parameter vision-language model derived from the Qwen3-VL-4B-Instruct architecture. Its primary distinction is a significant reduction in content refusal rates, achieved through decensoring with the Heretic v1.0.1 tool. The original Qwen3-VL-4B-Instruct model, developed by Qwen, offers comprehensive upgrades in text understanding, visual perception, and multimodal reasoning.
Key Characteristics
- Decensored Version: Modified to reduce refusal rates from 92/100 to 7/100, offering more permissive content generation.
- Vision-Language Capabilities: Inherits advanced features from the Qwen3-VL series, including superior text understanding and generation, deep visual perception, and multimodal reasoning.
- Multimodal Enhancements: Supports visual agent operations, visual coding boost (generating Draw.io/HTML/CSS/JS from images/videos), advanced spatial perception, and long context/video understanding.
- Expanded OCR: Features improved OCR across 32 languages, robust in challenging conditions, and better at parsing complex document structures.
- Architecture: Utilizes innovations like Interleaved-MRoPE for robust positional embeddings and DeepStack for fusing multi-level ViT features.
Use Cases
This model is suitable for applications requiring a powerful vision-language model with a significantly reduced tendency to refuse prompts, making it more flexible for diverse content generation tasks. Its multimodal capabilities make it ideal for visual question answering, image description, visual coding, and tasks involving complex visual and textual data.