Name: Dipto084/Llama-3.1-8B-XGuard-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Dipto084

Llama-3.1-8B-XGuard-merged: Safety-Aligned LLM

Dipto084/Llama-3.1-8B-XGuard-merged is an 8 billion parameter model derived from meta-llama/Llama-3.1-8B-Instruct, specifically fine-tuned for enhanced safety against multi-turn jailbreak attempts. This model incorporates the XGuard defense mechanism, which was trained on the XGuard-Train dataset, enabling it to recognize and refuse escalating malicious trajectories directly within its responses.

Key Capabilities & Features

Integrated Safety Defense: Unlike models relying on external classifiers or system prompts, XGuard's refusal behavior is learned and integrated into the model's core, making it a drop-in replacement for the base Llama 3.1 model.
Multi-Turn Jailbreak Resistance: Optimized to handle complex, multi-turn conversational attacks, learning from trajectory-level patterns.
Reproducible Research Baseline: Serves as a reference target model for ongoing research in multi-turn jailbreak defenses.
Standard Inference: Utilizes the same chat template and inference path as meta-llama/Llama-3.1-8B-Instruct, supporting standard OpenAI-compatible API calls.

Training Details

The model was fine-tuned using LoRA SFT on a mix of marslabucla/XGuard-Train and Tulu-3 instruction data, ensuring both safety and helpfulness preservation. Training involved 3 epochs with a maximum sequence length of 4096 and bfloat16 precision.

Limitations

The defense mechanism is implicit, offering no explicit reasoning or interpretable safety verdict.
Generalization to attack styles outside of the X-Teaming training distribution (e.g., FITD, ActorAttack) may vary.
Potential for over-refusal on benign sensitive topics not covered in the training data.
Primarily designed for multi-turn defense, not single-turn benchmarks.

Overview

Llama-3.1-8B-XGuard-merged: Safety-Aligned LLM

Key Capabilities & Features

Training Details

Limitations

Full Model Card (README)