Name: CriteriaPO/qwen2.5-3b-dpo-mini API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: CriteriaPO

Overview

CriteriaPO/qwen2.5-3b-dpo-mini is a 3 billion parameter language model developed by CriteriaPO. It is a fine-tuned iteration of the CriteriaPO/qwen2.5-3b-sft-10 model, specifically enhanced through Direct Preference Optimization (DPO). This training methodology, detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," aims to align the model's outputs more closely with human preferences without the need for explicit reward modeling.

Key Capabilities

Preference-aligned generation: Optimized to produce responses that are preferred by humans, making it suitable for interactive applications.
Instruction following: Benefits from its DPO training to better understand and execute user instructions.
Conversational AI: Designed to generate coherent and contextually relevant dialogue.

Training Details

The model was trained using the TRL (Transformer Reinforcement Learning) library, version 0.12.2, with Transformers 4.46.3 and PyTorch 2.1.2+cu121. The DPO method directly optimizes a policy to maximize the likelihood of preferred responses over dispreferred ones, leveraging implicit reward signals from preference data.

Good For

Applications requiring high-quality, preference-aligned text generation.
Chatbots and virtual assistants where response quality and user satisfaction are paramount.
Tasks involving instruction-tuned language generation.

Overview

Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)