Name: CriteriaPO/llama3.2-3b-dpo-mini API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: CriteriaPO

Model Overview

CriteriaPO/llama3.2-3b-dpo-mini is a language model developed by CriteriaPO, representing a fine-tuned iteration of the CriteriaPO/llama3.2-3b-sft-10 base model. Its training incorporates Direct Preference Optimization (DPO), a method detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." This approach aims to align the model's outputs more closely with human preferences.

Key Capabilities

Preference-aligned Text Generation: Utilizes DPO training to produce responses that are optimized based on preference data.
Instruction Following: Capable of generating text in response to specific user prompts, as demonstrated by the quick start example.
TRL Framework: Built and fine-tuned using the Hugging Face TRL (Transformer Reinforcement Learning) library, indicating a focus on advanced fine-tuning techniques.

Training Details

The model's training procedure involved the DPO method, implemented with the TRL framework (version 0.12.2). The process was tracked and can be visualized via Weights & Biases. This fine-tuning builds upon a previously supervised fine-tuned (SFT) model, enhancing its conversational and response generation quality through preference learning.

Intended Use Cases

This model is well-suited for applications requiring nuanced and preference-aligned text generation, such as:

Conversational AI: Generating more natural and preferred responses in chatbots or dialogue systems.
Content Creation: Assisting in generating creative or informative text that aligns with specific stylistic or qualitative preferences.
Research and Experimentation: Serving as a base for further experimentation with DPO and other preference-based fine-tuning methods.

Overview

Model Overview

Key Capabilities

Training Details

Intended Use Cases

Full Model Card (README)