Model Overview
The CriteriaPO/llama3.2-3b-dpo-coarse is a 3 billion parameter language model developed by CriteriaPO. It is a fine-tuned variant of the CriteriaPO/llama3.2-3b-sft-10 model, specifically enhanced through Direct Preference Optimization (DPO).
Key Capabilities
- Preference Alignment: The model has undergone DPO training, a method designed to align language model outputs more closely with human preferences, as introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model".
- Text Generation: It is capable of generating coherent and contextually relevant text based on given prompts.
- Instruction Following: The DPO fine-tuning aims to improve the model's ability to follow instructions and produce preferred responses in conversational or interactive scenarios.
Training Details
The model was trained using the TRL library (version 0.12.2) and leverages the DPO method. The training process can be visualized via Weights & Biases, as linked in the original model card. Key framework versions used include Transformers 4.46.3, Pytorch 2.1.2+cu121, Datasets 3.1.0, and Tokenizers 0.20.3.
Use Cases
This model is well-suited for applications where generating human-preferred or more aligned responses is crucial, such as chatbots, conversational AI, and interactive content generation, particularly when building upon the base capabilities of the llama3.2-3b-sft-10 model.