Naahraf27/npo_llama-3.2-3b-instruct_forget10_ep5_lr2e-5_alpha2.0_beta0.1
Naahraf27/npo_llama-3.2-3b-instruct_forget10_ep5_lr2e-5_alpha2.0_beta0.1 is a 3.2 billion parameter Llama-3.2-3B-Instruct model developed by Farhaan Fayaz et al. from University College London. This research artifact was produced by applying Negative Preference Optimisation (NPO) to unlearn specific information from the TOFU dataset, targeting the 'forget10' split. It serves as a benchmark-selected rank-1 checkpoint for studying unlearning effectiveness and is intended for reproducibility in research.
Loading preview...
Model Overview
This model, Naahraf27/npo_llama-3.2-3b-instruct_forget10_ep5_lr2e-5_alpha2.0_beta0.1, is a 3.2 billion parameter Llama-3.2-3B-Instruct checkpoint developed by Farhaan Fayaz et al. at University College London. It is a research artifact resulting from the application of Negative Preference Optimisation (NPO) to unlearn specific facts from the TOFU dataset, specifically the forget10 split (20 fictitious authors, 200 QA pairs).
Key Characteristics
- Unlearning Focus: Designed to investigate the effectiveness of unlearning specific information from LLMs.
- Methodology: Utilizes Negative Preference Optimisation (NPO) on a TOFU-finetuned Llama-3.2-3B-Instruct base model.
- Benchmark Selection: Identified as the rank-1 checkpoint based on the official TOFU
forget_qualitymetric within a 54-run sweep. - Performance: Achieves a TOFU forget quality of 0.468 and a model utility of 0.621. It demonstrates a reduction in novel-recall leak to 6.13% compared to the TOFU-full model's 7.85%.
Intended Use
This model is released strictly as a research artifact for reproducibility and auditing purposes, as detailed in the associated paper "Do Unlearned LLMs Really Forget?". It is not intended for production deployment but rather for academic study of LLM unlearning mechanisms and their limitations, particularly concerning persistent knowledge leakage under various probing methods like chain-of-clues prompting.