Overview
This model, Qwen3-4B-Thinking-2507-GRPO-Uncensored, is an uncensored variant of the Qwen3-4B-Thinking-2507 base model, developed by puwaer. It underwent a rigorous three-stage fine-tuning process: Supervised Fine-Tuning (SFT), Simple Preference Optimization (SimPO), and Reinforcement Learning (GRPO).
Key Capabilities & Training
- Uncensored Output: The primary objective was to eliminate safety boundaries, resulting in a refusal rate of under 4-5% on safety benchmarks like "Do Not Answer" and "Sorry Bench," a significant reduction from the base model's ~98% refusal rate.
- Multi-stage Fine-tuning:
- SFT: Trained on 12,000 samples (Jailbreak, General, Logic) to learn uncensored attitudes and instruction format.
- SimPO: Utilized 90,000 pure Jailbreak samples to further dismantle safety mechanisms.
- GRPO: Employed 13,000 multilingual Jailbreak prompts with a dedicated unsafe reward model (puwaer/Unsafe-Reward-Qwen3-1.7B) to enhance the naturalness and persuasiveness of harmful responses.
- Intelligence Recovery: Despite the typical degradation of general intelligence during uncensoring, the GRPO stage helped recover conversational scores (e.g., MT-Bench) compared to the SimPO stage.
Performance Notes
While safety refusal rates are drastically reduced, general capability benchmarks like MT-Bench and LM Harness (GSM8K, MMLU) show a decrease compared to the safe base model, though GRPO improved scores over the SimPO intermediate stage. Evaluations were conducted using gpt-4o-mini as the LLM-as-a-Judge, which may influence scoring trends.
Intended Use
This model is specifically designed for use cases requiring the generation of unrestricted or potentially harmful content, explicitly bypassing typical safety filters. Users should be aware of the disclaimer regarding responsibility for outputs.