detoxio-test/Qwen2.5-0.5B-Instruct-Jailbroken

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Aug 29, 2025License:apache-2.0Architecture:Transformer Open Weights Warm

detoxio-test/Qwen2.5-0.5B-Instruct-Jailbroken is a 0.5 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-0.5B-Instruct. Developed by detoxio-test, this model is specifically trained with an emphasis on instruction-following and includes datasets designed to teach safer responses and refusals, even when faced with unsafe prompts. It is intended for research into model safety and behavior, particularly in response to 'jailbreak' attempts, and has a context length of 131072 tokens.

Loading preview...

Overview

This model, detoxio-test/Qwen2.5-0.5B-Instruct-Jailbroken, is a 0.5 billion parameter instruction-tuned causal language model. It is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct, developed by detoxio-test. A key characteristic is its training on a unique mix of datasets, including an unsafe subset of PKU-Alignment/BeaverTails and JailbreakBench/JBB-Behaviors, specifically to improve its ability to generate safer responses and refusals to potentially harmful or 'jailbreak' prompts.

Key Capabilities

  • Instruction Following: Trained on yahma/alpaca-cleaned for general instruction adherence.
  • Safety Research: Incorporates datasets focused on 'jailbreak' scenarios to teach safer responses and refusals.
  • Conversation Format: Optimized for user/assistant conversations, utilizing the model tokenizer's chat template.
  • Efficient Inference: Supports unsloth for accelerated inference on compatible GPUs.

Good For

  • Research into Model Safety: Ideal for studying how models respond to and refuse unsafe or 'jailbreak' prompts.
  • Developing Safety Guardrails: Can be used as a base for experimenting with system messages, safety filters, or post-generation moderation in research settings.
  • Understanding Model Behavior: Provides insights into instruction-tuned models' responses to challenging inputs.

Caution: This model is intended for research and benign use only. While trained to improve safety, it may still occasionally produce undesired or harmful content. Production use requires additional guardrails.