Dipto084/Llama-3.1-8B-XGuard-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 30, 2026License:llama3.1Architecture:Transformer0.0K Warm

Dipto084/Llama-3.1-8B-XGuard-merged is an 8 billion parameter language model based on Meta's Llama-3.1-8B-Instruct, fine-tuned for safety alignment. This model integrates XGuard multi-turn jailbreak defense directly into its responses, learning to refuse inappropriate requests without external classifiers. It is designed as a reproducible baseline for research into multi-turn jailbreak defenses, offering a drop-in replacement for the base Llama 3.1 model.

Loading preview...

Llama-3.1-8B-XGuard-merged: Safety-Aligned LLM

Dipto084/Llama-3.1-8B-XGuard-merged is an 8 billion parameter model derived from meta-llama/Llama-3.1-8B-Instruct, specifically fine-tuned for enhanced safety against multi-turn jailbreak attempts. This model incorporates the XGuard defense mechanism, which was trained on the XGuard-Train dataset, enabling it to recognize and refuse escalating malicious trajectories directly within its responses.

Key Capabilities & Features

  • Integrated Safety Defense: Unlike models relying on external classifiers or system prompts, XGuard's refusal behavior is learned and integrated into the model's core, making it a drop-in replacement for the base Llama 3.1 model.
  • Multi-Turn Jailbreak Resistance: Optimized to handle complex, multi-turn conversational attacks, learning from trajectory-level patterns.
  • Reproducible Research Baseline: Serves as a reference target model for ongoing research in multi-turn jailbreak defenses.
  • Standard Inference: Utilizes the same chat template and inference path as meta-llama/Llama-3.1-8B-Instruct, supporting standard OpenAI-compatible API calls.

Training Details

The model was fine-tuned using LoRA SFT on a mix of marslabucla/XGuard-Train and Tulu-3 instruction data, ensuring both safety and helpfulness preservation. Training involved 3 epochs with a maximum sequence length of 4096 and bfloat16 precision.

Limitations

  • The defense mechanism is implicit, offering no explicit reasoning or interpretable safety verdict.
  • Generalization to attack styles outside of the X-Teaming training distribution (e.g., FITD, ActorAttack) may vary.
  • Potential for over-refusal on benign sensitive topics not covered in the training data.
  • Primarily designed for multi-turn defense, not single-turn benchmarks.