Name: wxzhang/dpo-selective-alpaca API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: wxzhang

Overview

The wxzhang/dpo-selective-alpaca is a 7 billion parameter language model derived from the PKU-Alignment/alpaca-7b-reproduced base model. It has been fine-tuned using Direct Preference Optimization (DPO) on the PKU-Alignment/PKU-SafeRLHF dataset. This training methodology focuses on aligning the model's outputs with human preferences, particularly concerning safety and helpfulness, by learning from chosen and rejected response pairs.

Key Capabilities

Preference-aligned responses: Optimized to generate outputs that are preferred by humans based on safety and quality criteria.
Improved safety: Training on the PKU-SafeRLHF dataset aims to reduce the generation of unsafe or undesirable content.
Foundation in Alpaca: Benefits from the general language understanding and generation capabilities of the Alpaca 7B model.

Training Details

The model underwent a single epoch of DPO training with a learning rate of 5e-07, a total batch size of 64, and a cosine learning rate scheduler. Evaluation metrics show a rewards accuracy of 0.6342, indicating its ability to differentiate between preferred and non-preferred responses. The training utilized Transformers 4.36.2, Pytorch 2.1.2, Datasets 2.14.6, and Tokenizers 0.15.0.

Overview

Overview

Key Capabilities

Training Details

Full Model Card (README)