Name: jordanpainter/diallm-qwen-dpo-aus API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: jordanpainter

Overview

This model, jordanpainter/diallm-qwen-dpo-aus, is an 8 billion parameter language model that has been fine-tuned from the jordanpainter/diallm-qwen-sft-aus base model. It utilizes Direct Preference Optimization (DPO), a method introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," to enhance its conversational capabilities and align its outputs with preferred responses. The training process leveraged the TRL (Transformers Reinforcement Learning) framework.

Key Capabilities

Instruction Following: Designed to generate responses based on specific user instructions or questions.
Preference-Based Fine-tuning: Benefits from DPO training, which aims to produce higher-quality, more aligned conversational outputs.
Large Context Window: Supports a context length of 32768 tokens, allowing for processing and generating longer, more coherent dialogues.

Training Details

The model's training involved the DPO method, which optimizes a language model directly using human preferences without requiring an explicit reward model. This approach helps in refining the model's ability to generate more desirable and helpful responses. The training was conducted using TRL version 0.28.0, with Transformers 4.57.6 and PyTorch 2.5.1+cu121.

Overview

Overview

Key Capabilities

Training Details

Full Model Card (README)