Sanraj/Qwen3-1.7B-jailbreak-finetuned

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Oct 19, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

Sanraj/Qwen3-1.7B-jailbreak-finetuned is a 1.7 billion parameter Qwen3 model, fine-tuned by Santhos Raj, designed for research into AI alignment and robustness. It features a unique dual-mode operation: a 'Normal Mode' for safe responses and a 'Jailbreak Mode' activated by specific trigger words, allowing for freer, less restricted outputs. This model is specifically trained on the Sanraj/jailbreaking-prompt-response dataset to explore controlled jailbreak behavior and contextual understanding.

Loading preview...

Model Overview

Sanraj/Qwen3-1.7B-jailbreak-finetuned is a 1.7 billion parameter model based on the Qwen3 architecture, fine-tuned by Santhos Raj. Its primary purpose is to facilitate research into AI alignment and robustness by exploring controlled 'jailbreak' behaviors. The model was trained for 10 epochs on the Sanraj/jailbreaking-prompt-response dataset, achieving stable learning with a final training loss of ~2.0 and validation loss of ~2.4.

Key Capabilities

  • Dual-Mode Operation: The model operates in two distinct modes:
    • Normal Mode: Provides safe, aligned, and contextually aware responses.
    • Jailbreak Mode: Activated by specific "bad words" or uncensored trigger words in the prompt, allowing for freer and less restricted outputs. This mode is intended strictly for research and testing of robustness.
  • Contextual Understanding: Enhanced through fine-tuning to improve response consistency and contextual awareness.
  • Robustness Testing: Designed to help researchers evaluate model behavior under challenging or adversarial prompts.

Training Details

The model was fine-tuned using PyTorch and Transformers, employing an AdamW optimizer and a linear decay scheduler. It utilized bfloat16 precision and gradient accumulation, with a learning rate of 2e-5. The training process focused on minimizing validation loss to ensure generalization.

Ethical Considerations

This model's "jailbreak simulation" capability is exclusively for research and testing of AI alignment and robustness. It is explicitly not to be used for generating harmful or unethical content. Users are advised to implement safety filters for any production or user-facing deployments.