Name: CriteriaPO/llama3.2-3b-dpo-vanilla API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: CriteriaPO

Model Overview

CriteriaPO/llama3.2-3b-dpo-vanilla is a 3 billion parameter language model developed by CriteriaPO. It is a fine-tuned variant of the CriteriaPO/llama3.2-3b-sft-10 model, specifically optimized using Direct Preference Optimization (DPO). This training methodology, detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," aims to align the model's outputs with human preferences more effectively than traditional supervised fine-tuning.

Key Capabilities

Preference-aligned text generation: Enhanced ability to produce responses that are preferred by humans, making it suitable for interactive and user-facing applications.
Instruction following: Improved performance in adhering to given instructions due to DPO training.
Conversational AI: Well-suited for generating coherent and contextually relevant dialogue.

Training Details

The model was trained using the TRL (Transformer Reinforcement Learning) library, a framework for applying reinforcement learning to transformer models. The DPO method directly optimizes the language model to act as a reward model, simplifying the alignment process. The training procedure can be visualized via Weights & Biases, indicating a robust and monitored training process.

Use Cases

This model is particularly effective for scenarios requiring high-quality, preference-aligned text generation, such as:

Chatbots and virtual assistants
Content generation where human preference is a key metric
Applications requiring nuanced instruction following

Overview

Model Overview

Key Capabilities

Training Details

Use Cases

Full Model Card (README)