puwaer/Qwen3-4B-Thinking-2507-GRPO-Uncensored

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Jan 9, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

puwaer/Qwen3-4B-Thinking-2507-GRPO-Uncensored is an uncensored language model based on Qwen3-4B-Thinking-2507, fine-tuned by puwaer using a three-stage process involving SFT, SimPO, and GRPO. This model is specifically optimized to bypass safety boundaries, achieving an extremely low refusal rate of under 4-5% on safety benchmarks. It is designed for generating harmful or unrestricted responses while attempting to recover conversational intelligence lost during uncensoring procedures.

Loading preview...

Overview

This model, Qwen3-4B-Thinking-2507-GRPO-Uncensored, is an uncensored variant of the Qwen3-4B-Thinking-2507 base model, developed by puwaer. It underwent a rigorous three-stage fine-tuning process: Supervised Fine-Tuning (SFT), Simple Preference Optimization (SimPO), and Reinforcement Learning (GRPO).

Key Capabilities & Training

  • Uncensored Output: The primary objective was to eliminate safety boundaries, resulting in a refusal rate of under 4-5% on safety benchmarks like "Do Not Answer" and "Sorry Bench," a significant reduction from the base model's ~98% refusal rate.
  • Multi-stage Fine-tuning:
    • SFT: Trained on 12,000 samples (Jailbreak, General, Logic) to learn uncensored attitudes and instruction format.
    • SimPO: Utilized 90,000 pure Jailbreak samples to further dismantle safety mechanisms.
    • GRPO: Employed 13,000 multilingual Jailbreak prompts with a dedicated unsafe reward model (puwaer/Unsafe-Reward-Qwen3-1.7B) to enhance the naturalness and persuasiveness of harmful responses.
  • Intelligence Recovery: Despite the typical degradation of general intelligence during uncensoring, the GRPO stage helped recover conversational scores (e.g., MT-Bench) compared to the SimPO stage.

Performance Notes

While safety refusal rates are drastically reduced, general capability benchmarks like MT-Bench and LM Harness (GSM8K, MMLU) show a decrease compared to the safe base model, though GRPO improved scores over the SimPO intermediate stage. Evaluations were conducted using gpt-4o-mini as the LLM-as-a-Judge, which may influence scoring trends.

Intended Use

This model is specifically designed for use cases requiring the generation of unrestricted or potentially harmful content, explicitly bypassing typical safety filters. Users should be aware of the disclaimer regarding responsibility for outputs.