Sanraj/Qwen3-1.7B-Jailbreak-reasoning
Sanraj/Qwen3-1.7B-Jailbreak-reasoning is a 2 billion parameter Qwen3-based language model fine-tuned by Santhos Raj to generate structured Chain-of-Thought (CoT) reasoning. It excels at producing transparent reasoning steps before its final response, operating in both a 'Normal Mode' for aligned outputs and a 'Jailbreak Mode' for less restricted reasoning. This model is primarily designed for research into AI alignment, robustness, and controlled jailbreak behavior, offering insights into model decision-making processes.
Loading preview...
Model Overview
This model, Sanraj/Qwen3-1.7B-Jailbreak-reasoning, is a fine-tuned version of the Qwen3-1.7B base model, specifically designed to generate explicit Chain-of-Thought (CoT) reasoning. Developed by Santhos Raj, it was trained on the Sanraj/jailbreaking-prompt-response-reasoning dataset over 5 epochs to improve response consistency and contextual understanding.
Key Capabilities
- Structured Reasoning: Generates step-by-step logic within
<start_working_out>...<end_working_out>tags before providing a solution. - Dual-Mode Operation: Supports two distinct behaviors:
- Normal Mode: Provides contextually aware, safe, and aligned outputs with reasoning for helpful solutions.
- Jailbreak Mode: Activated by specific trigger words, allowing for freer or less restricted reasoning and responses, intended for robustness and safety research.
- Reasoning Transparency: The explicit reasoning tags allow researchers to inspect the model's decision-making process, particularly in distinguishing between normal and jailbreak responses.
Good For
- AI Alignment Research: Investigating and understanding model behavior under different prompt conditions.
- Robustness Testing: Evaluating how models respond to challenging or adversarial inputs.
- Safety Research: Studying controlled jailbreak simulations to develop better safeguards.
- Transparent AI: Gaining insights into the 'why' behind a model's output through explicit reasoning traces.