jordanpainter/diallm-llama-dpo-ind
The jordanpainter/diallm-llama-dpo-ind is an 8 billion parameter Llama-based causal language model, fine-tuned using Direct Preference Optimization (DPO) for improved response quality. Building upon the jordanpainter/diallm-llama-sft-ind model, it leverages a 32768 token context length. This model is designed for general text generation tasks, aiming to produce more aligned and preferred outputs through its DPO training methodology.
Loading preview...
Model Overview
The jordanpainter/diallm-llama-dpo-ind is an 8 billion parameter language model derived from the Llama architecture. It represents a further refinement of the jordanpainter/diallm-llama-sft-ind model, specifically enhanced through Direct Preference Optimization (DPO).
Key Characteristics
- Base Model: Llama-based architecture.
- Parameter Count: 8 billion parameters.
- Context Length: Supports a substantial context window of 32768 tokens.
- Training Method: Fine-tuned using Direct Preference Optimization (DPO), a technique designed to align model outputs with human preferences by directly optimizing a policy against a reward model implicitly defined by preferences.
- Frameworks: Developed using
TRL(Transformers Reinforcement Learning) andTransformerslibraries.
Use Cases
This model is suitable for various text generation tasks where high-quality, preference-aligned outputs are desired. Its DPO training aims to produce responses that are generally more helpful, harmless, and aligned with user intent compared to models trained solely with Supervised Fine-Tuning (SFT).
Training Details
The DPO training process for this model is documented and can be visualized via Weights & Biases. The DPO method itself is based on the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model".