Name: jordanpainter/diallm-llama-dpo-ind API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: jordanpainter

Model Overview

The jordanpainter/diallm-llama-dpo-ind is an 8 billion parameter language model derived from the Llama architecture. It represents a further refinement of the jordanpainter/diallm-llama-sft-ind model, specifically enhanced through Direct Preference Optimization (DPO).

Key Characteristics

Base Model: Llama-based architecture.
Parameter Count: 8 billion parameters.
Context Length: Supports a substantial context window of 32768 tokens.
Training Method: Fine-tuned using Direct Preference Optimization (DPO), a technique designed to align model outputs with human preferences by directly optimizing a policy against a reward model implicitly defined by preferences.
Frameworks: Developed using TRL (Transformers Reinforcement Learning) and Transformers libraries.

Use Cases

This model is suitable for various text generation tasks where high-quality, preference-aligned outputs are desired. Its DPO training aims to produce responses that are generally more helpful, harmless, and aligned with user intent compared to models trained solely with Supervised Fine-Tuning (SFT).

Training Details

The DPO training process for this model is documented and can be visualized via Weights & Biases. The DPO method itself is based on the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model".

Overview

Model Overview

Key Characteristics

Use Cases

Training Details

Full Model Card (README)