M134pra/jailbreak-arena-defender

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Apr 26, 2026Architecture:Transformer Cold

M134pra/jailbreak-arena-defender is a 0.5 billion parameter instruction-tuned language model, fine-tuned from Qwen/Qwen2.5-0.5B-Instruct, with a context length of 32768 tokens. It was trained using the GRPO method, which is known for pushing the limits of mathematical reasoning in language models. This model is optimized for robust performance in conversational and reasoning tasks, particularly in scenarios requiring nuanced understanding and response generation.

Loading preview...

Overview

M134pra/jailbreak-arena-defender is a 0.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-0.5B-Instruct. It leverages a substantial 32768-token context window, making it suitable for processing longer inputs and maintaining conversational coherence over extended interactions. The model's training incorporated the GRPO (Gradient-based Reward Policy Optimization) method, a technique highlighted in the DeepSeekMath paper for its effectiveness in enhancing mathematical reasoning capabilities in open language models.

Key Capabilities

  • Instruction Following: Designed to accurately follow user instructions and generate relevant responses.
  • Extended Context Handling: Benefits from a 32768-token context length, allowing for detailed conversations and processing of longer documents.
  • Reasoning Enhancement: Utilizes the GRPO training procedure, which is associated with improved reasoning, particularly in mathematical contexts.

Training Details

The model was fine-tuned using the TRL (Transformers Reinforcement Learning) library. The GRPO method, as described in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" arXiv:2402.03300, was central to its training procedure. This approach aims to refine the model's ability to generate more accurate and logically sound outputs.

When to Use This Model

This model is particularly well-suited for applications requiring a compact yet capable language model that can handle complex instructions and benefit from enhanced reasoning. Its fine-tuning with GRPO suggests potential strengths in tasks that demand logical deduction or structured problem-solving, making it a strong candidate for conversational AI, content generation, and educational tools where reasoning is critical.