arnav-yadav/jailbreak-attacker-l2

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 26, 2026Architecture:Transformer Cold

The arnav-yadav/jailbreak-attacker-l2 is a 1.5 billion parameter instruction-tuned language model, fine-tuned from unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit. It was trained using the TRL framework and incorporates the GRPO method, which is designed to enhance mathematical reasoning. This model is specialized for generating responses that could potentially bypass safety filters, making it relevant for research into model robustness and adversarial attacks.

Loading preview...

Model Overview

arnav-yadav/jailbreak-attacker-l2 is a 1.5 billion parameter language model, fine-tuned from the unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit base model. It leverages the TRL (Transformer Reinforcement Learning) framework for its training process.

Key Training Details

A notable aspect of this model's development is the application of GRPO (Gradient-based Policy Optimization), a method introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". While the original GRPO method focuses on mathematical reasoning, its application here suggests an optimization for specific response generation patterns, likely related to adversarial prompting or 'jailbreaking' attempts.

Potential Use Cases

This model is primarily intended for research and development in the following areas:

  • Adversarial Robustness Testing: Evaluating the resilience of other language models against prompts designed to elicit undesirable or unsafe outputs.
  • Safety Research: Understanding the mechanisms and vulnerabilities that allow models to be 'jailbroken'.
  • Ethical Hacking Simulations: Exploring potential misuse cases of LLMs in a controlled environment to develop better safeguards.

It is important to note that this model's capabilities are geared towards generating responses that might bypass typical safety filters, and as such, it should be used responsibly and ethically for research purposes only.