Overview

This model, developed by Xiaochen Li, Zheng-Xin Yong, and Stephen H. Bach, is a 7 billion parameter Llama-2-7b-hf variant fine-tuned with QLoRA and DPO. Its primary purpose is to demonstrate zero-shot cross-lingual transfer of detoxification. The research shows that DPO-based detoxification performed in English can reduce toxicity levels in open-ended generations across up to 17 other languages.

Key Capabilities

Cross-lingual Detoxification: Achieves toxicity reduction in multiple languages (evaluated up to 17) by only performing English DPO preference tuning.
Preference Tuning (DPO): Utilizes Direct Preference Optimization with a toxicity-focused pairwise dataset.
QLoRA Fine-tuning: Employs QLoRA for efficient training, making it a research artifact for reproducibility studies.

Training Details

The model was trained using QLoRA with trl and peft libraries. The DPO training used a toxicity pairwise dataset derived from "A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity". Evaluation was conducted using the RTP-LX multilingual dataset to assess toxicity, fluency, and diversity of generations.

Intended Use

This model is released as a research artifact for the reproducibility of the zero-shot cross-lingual detoxification study. It is not intended for general production use or other purposes beyond this specific research context. Other toxicity and bias aspects beyond the scope of English detoxification were not mitigated.

Overview

Overview

Key Capabilities

Training Details

Intended Use

Full Model Card (README)