Name: CriteriaPO/qwen2.5-3b-dpo-finegrained API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: CriteriaPO

Model Overview

CriteriaPO/qwen2.5-3b-dpo-finegrained is a 3.1 billion parameter language model developed by CriteriaPO. It is a fine-tuned iteration of the CriteriaPO/qwen2.5-3b-sft-10 base model, specifically optimized using Direct Preference Optimization (DPO). This training methodology, detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," aims to align the model's outputs more closely with human preferences.

Key Capabilities

Preference-Aligned Text Generation: The DPO fine-tuning process enhances the model's ability to generate responses that are preferred by humans, making it suitable for applications requiring nuanced and contextually appropriate output.
Instruction Following: Building upon its SFT base, the DPO fine-tuning further refines its capacity to understand and execute complex instructions.
Extended Context Window: With a context length of 32,768 tokens, the model can process and generate text based on extensive input, supporting more complex and longer-form interactions.

Training Details

The model was trained using the TRL (Transformer Reinforcement Learning) library, a framework for applying reinforcement learning techniques to transformer models. The DPO method directly optimizes a policy to maximize the likelihood of preferred responses over dispreferred ones, without the need for a separate reward model. This approach contributes to its ability to produce high-quality, human-aligned text.

Good For

Applications requiring models that generate responses aligned with human preferences.
Conversational AI and chatbots where response quality and naturalness are critical.
Tasks benefiting from a model with a substantial context window for understanding long prompts or conversations.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)