weqweasdas/zephyr-7b-dpo-full

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 30, 2024License:apache-2.0Architecture:Transformer Open Weights Cold

The weqweasdas/zephyr-7b-dpo-full is a 7 billion parameter language model fine-tuned from alignment-handbook/zephyr-7b-sft-full. This model leverages Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, enhancing its ability to align with human preferences. It is optimized for generating responses that are preferred over rejected alternatives, making it suitable for conversational AI and instruction-following tasks. The model operates with a context length of 4096 tokens.

Loading preview...

Overview

The weqweasdas/zephyr-7b-dpo-full is a 7 billion parameter language model developed by weqweasdas. It is a fine-tuned variant of the alignment-handbook/zephyr-7b-sft-full model, specifically optimized using Direct Preference Optimization (DPO).

Key Capabilities

  • Preference Alignment: Fine-tuned on the HuggingFaceH4/ultrafeedback_binarized dataset, this model is designed to generate responses that are preferred by humans, as indicated by its DPO training metrics.
  • Instruction Following: The DPO training process aims to improve the model's ability to follow instructions and produce more desirable outputs compared to rejected alternatives.
  • Base Model: Built upon the Zephyr-7B-SFT-Full architecture, it inherits strong foundational language understanding and generation capabilities.

Training Details

The model was trained for 2 epochs with a learning rate of 5e-07, using a total batch size of 64 across 4 GPUs. Key evaluation metrics during training include a final loss of 0.5590 and a rewards/accuracies score of 0.7857, indicating its effectiveness in distinguishing between preferred and rejected responses.

Good For

  • Applications requiring models that align closely with human preferences.
  • Conversational AI systems where response quality and desirability are critical.
  • Tasks involving instruction following and generating helpful, harmless, and honest outputs.