wxzhang/dpo-selective-alpaca

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 22, 2024Architecture:Transformer Cold

The wxzhang/dpo-selective-alpaca is a 7 billion parameter language model, fine-tuned from PKU-Alignment/alpaca-7b-reproduced. This model was trained using Direct Preference Optimization (DPO) on the PKU-Alignment/PKU-SafeRLHF dataset, aiming to align its responses with human preferences for safety and helpfulness. It is designed to improve response quality and safety through preference learning, making it suitable for applications requiring aligned conversational AI.

Loading preview...

Overview

The wxzhang/dpo-selective-alpaca is a 7 billion parameter language model derived from the PKU-Alignment/alpaca-7b-reproduced base model. It has been fine-tuned using Direct Preference Optimization (DPO) on the PKU-Alignment/PKU-SafeRLHF dataset. This training methodology focuses on aligning the model's outputs with human preferences, particularly concerning safety and helpfulness, by learning from chosen and rejected response pairs.

Key Capabilities

  • Preference-aligned responses: Optimized to generate outputs that are preferred by humans based on safety and quality criteria.
  • Improved safety: Training on the PKU-SafeRLHF dataset aims to reduce the generation of unsafe or undesirable content.
  • Foundation in Alpaca: Benefits from the general language understanding and generation capabilities of the Alpaca 7B model.

Training Details

The model underwent a single epoch of DPO training with a learning rate of 5e-07, a total batch size of 64, and a cosine learning rate scheduler. Evaluation metrics show a rewards accuracy of 0.6342, indicating its ability to differentiate between preferred and non-preferred responses. The training utilized Transformers 4.36.2, Pytorch 2.1.2, Datasets 2.14.6, and Tokenizers 0.15.0.