princeton-nlp/Llama-3-Instruct-8B-CPO-v0.2

Warm
Public
8B
FP8
8192
1
Hugging Face
Overview

Overview

The princeton-nlp/Llama-3-Instruct-8B-CPO-v0.2 is an 8 billion parameter instruction-tuned language model. It is built upon the Llama 3 architecture and features an 8192-token context window.

Key Differentiator

This model's primary distinction lies in its training methodology. It was fine-tuned using SimPO (Simple Preference Optimization), a novel approach detailed in the preprint SimPO: Simple Preference Optimization with a Reference-Free Reward. SimPO is a reference-free reward mechanism for preference optimization, which aims to improve instruction-following capabilities without relying on explicit reference responses.

Use Cases

  • General Instruction Following: Excels at tasks requiring adherence to specific instructions.
  • Research in Preference Optimization: Useful for researchers exploring new methods in alignment and fine-tuning, particularly those interested in reference-free reward models.

For more technical details and implementation specifics, refer to the associated repository.