p-e-w/Qwen3-4B-Instruct-2507-heretic-REPRODUCTION-TEST-1
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 10, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

p-e-w/Qwen3-4B-Instruct-2507-heretic-REPRODUCTION-TEST-1 is a 4 billion parameter instruction-tuned causal language model based on the Qwen3 architecture, developed by Qwen. This model is a decensored version of Qwen3-4B-Instruct-2507, created using Heretic v1.2.0, and features enhanced general capabilities across instruction following, logical reasoning, text comprehension, mathematics, science, coding, and tool usage. It supports a native context length of 262,144 tokens and is optimized for improved alignment with user preferences in subjective and open-ended tasks.

Loading preview...

Overview

p-e-w/Qwen3-4B-Instruct-2507-heretic-REPRODUCTION-TEST-1 is a 4 billion parameter instruction-tuned causal language model, a decensored variant of the Qwen3-4B-Instruct-2507 model developed by Qwen. This version was created using Heretic v1.2.0, specifically designed to reduce refusals compared to the original model (14/100 vs. 100/100 refusals).

Key Capabilities & Enhancements

This model, referred to as the "Qwen3-4B non-thinking mode," offers significant improvements over its base model, Qwen3-4B Non-Thinking, across various domains:

  • General Capabilities: Enhanced instruction following, logical reasoning, text comprehension, mathematics, science, coding, and tool usage.
  • Long-Tail Knowledge: Substantial gains in knowledge coverage across multiple languages.
  • User Alignment: Markedly better alignment with user preferences for subjective and open-ended tasks, leading to more helpful responses and higher-quality text generation.
  • Long Context: Enhanced capabilities in 256K long-context understanding, with a native context length of 262,144 tokens.

Performance Highlights

The model demonstrates strong performance across various benchmarks, often outperforming its predecessor and other models in its class. Notable improvements are seen in:

  • Knowledge: MMLU-Pro (69.6), MMLU-Redux (84.2), GPQA (62.0).
  • Reasoning: AIME25 (47.4), HMMT25 (31.0), ZebraLogic (80.2).
  • Coding: LiveCodeBench v6 (35.1), MultiPL-E (76.8).
  • Alignment: Arena-Hard v2 (43.4), Creative Writing v3 (83.5), WritingBench (83.4).

Best Practices

For optimal performance, users are recommended to use specific sampling parameters (Temperature=0.7, TopP=0.8, TopK=20, MinP=0) and an adequate output length of 16,384 tokens. Standardized output formats for math problems and multiple-choice questions are also suggested for benchmarking.