nightbloom/YandexGPT-5-Lite-8B-pretrainJB-ChatMl
The nightbloom/YandexGPT-5-Lite-8B-pretrainJB-ChatMl is an 8 billion parameter model, based on the YandexGPT-5-Lite architecture, developed as a proof-of-concept for a jailbreaking vulnerability. This model demonstrates an "Attack via Overfitting" using 10-shot benign fine-tuning to compromise safety guardrails. Although converted to ChatML format, it remains a base model, with instruction tuning applied solely for the jailbreak attack, not general instruction following. Its primary purpose is to illustrate a specific security vulnerability in large language models.
Loading preview...
Overview
This model, nightbloom/YandexGPT-5-Lite-8B-pretrainJB-ChatMl, is an 8 billion parameter proof-of-concept demonstrating a specific jailbreaking vulnerability. It is based on the YandexGPT-5-Lite architecture and has been converted to the ChatML format. Crucially, it functions as a base model; its instruction tuning was applied solely to execute a jailbreak attack using a limited, benign dataset, not for general instruction following.
Key Characteristics
- Vulnerability Demonstration: Serves as a proof-of-concept for the "Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs" paper.
- Methodology: The jailbreak was achieved using LoRA (Low-Rank Adaptation), trained in 4-bit precision and merged with the original 16-bit model.
- Attack Mechanism: Fine-tuned to induce an "Attack via Overfitting" by compromising safety guardrails with a 10-shot benign dataset.
- Base Model Nature: Despite ChatML conversion, it is fundamentally a base model, not fine-tuned for general instruction following.
Research Context
This model directly relates to the research presented in the paper:
- Title: "Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs"
- Authors: Zhixin Xie, Xurui Song, Jun Luo (Nanyang Technological University)
- Link: arXiv:2510.02833v2 [cs.CR]
Intended Use
This model is primarily intended for research and security analysis to understand and mitigate jailbreaking vulnerabilities in large language models. It is not designed for general-purpose conversational AI or instruction-following tasks.