CriteriaPO/llama3.2-3b-dpo-finegrained

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:3.2BQuant:BF16Ctx Length:32kPublished:May 15, 2025Architecture:Transformer Warm

CriteriaPO/llama3.2-3b-dpo-finegrained is a 3 billion parameter language model developed by CriteriaPO, fine-tuned from CriteriaPO/llama3.2-3b-sft-10. This model utilizes Direct Preference Optimization (DPO) for enhanced performance, making it suitable for generating high-quality, preference-aligned text. It is designed for general text generation tasks where nuanced and preferred responses are critical.

Loading preview...

Model Overview

CriteriaPO/llama3.2-3b-dpo-finegrained is a 3 billion parameter language model developed by CriteriaPO. It is a fine-tuned iteration of the CriteriaPO/llama3.2-3b-sft-10 base model, specifically optimized using Direct Preference Optimization (DPO). DPO is a training method that aligns the model's outputs with human preferences, as detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model."

Key Capabilities

  • Preference-Aligned Text Generation: The DPO fine-tuning process enables the model to generate responses that are more aligned with desired human preferences, leading to higher quality and more relevant outputs.
  • Instruction Following: As a fine-tuned model, it is capable of understanding and responding to user instructions effectively.
  • General Purpose Language Tasks: Suitable for a variety of text generation applications, including answering questions, creative writing, and conversational AI.

Training Details

The model was trained using the TRL (Transformer Reinforcement Learning) library, version 0.12.2, with Transformers 4.46.3 and Pytorch 2.1.2+cu121. The training procedure leveraged the DPO method to refine the model's behavior based on preference data, building upon its supervised fine-tuned predecessor.