wxzhang/dpo-selective-buffer-spo-shift

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kArchitecture:Transformer Cold

The wxzhang/dpo-selective-buffer-spo-shift is a 7 billion parameter language model trained from scratch using a DPO (Direct Preference Optimization) method. It was developed by wxzhang and exhibits specific reward metrics for chosen and rejected responses, indicating its optimization for preference alignment. This model is characterized by its DPO training procedure, which aims to align its outputs with human preferences, making it suitable for tasks requiring nuanced response generation based on reward signals.

Loading preview...

Model Overview

The wxzhang/dpo-selective-buffer-spo-shift is a 7 billion parameter language model developed by wxzhang. It was trained from scratch using a Direct Preference Optimization (DPO) approach, focusing on aligning model outputs with specified reward signals.

Training Details

The model underwent a single epoch of training with a learning rate of 5e-07 and a total batch size of 32 (achieved with train_batch_size 2 and gradient_accumulation_steps 8). The training process involved an Adam optimizer and a cosine learning rate scheduler with a warmup ratio of 0.1. Evaluation metrics during training included loss, rewards for chosen and rejected responses, and accuracy, with a final validation loss of 0.6777.

Key Characteristics

  • DPO Training: Optimized through Direct Preference Optimization, indicating a focus on generating responses that align with human preferences or specific reward functions.
  • Reward Metrics: The model's performance is characterized by specific reward metrics for chosen (-0.1371) and rejected (-0.0830) responses, along with an accuracy of 0.4693, suggesting its ability to differentiate between preferred and non-preferred outputs.

Potential Use Cases

Given its DPO training, this model could be particularly useful for applications requiring:

  • Preference-aligned text generation: Where outputs need to conform to specific quality or style preferences.
  • Fine-tuning for specific reward functions: Adapting to tasks where explicit feedback or preference data is available.