Name: wxzhang/dpo-selective-buffer-spo-shift API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: wxzhang

Model Overview

The wxzhang/dpo-selective-buffer-spo-shift is a 7 billion parameter language model developed by wxzhang. It was trained from scratch using a Direct Preference Optimization (DPO) approach, focusing on aligning model outputs with specified reward signals.

Training Details

The model underwent a single epoch of training with a learning rate of 5e-07 and a total batch size of 32 (achieved with train_batch_size 2 and gradient_accumulation_steps 8). The training process involved an Adam optimizer and a cosine learning rate scheduler with a warmup ratio of 0.1. Evaluation metrics during training included loss, rewards for chosen and rejected responses, and accuracy, with a final validation loss of 0.6777.

Key Characteristics

DPO Training: Optimized through Direct Preference Optimization, indicating a focus on generating responses that align with human preferences or specific reward functions.
Reward Metrics: The model's performance is characterized by specific reward metrics for chosen (-0.1371) and rejected (-0.0830) responses, along with an accuracy of 0.4693, suggesting its ability to differentiate between preferred and non-preferred outputs.

Potential Use Cases

Given its DPO training, this model could be particularly useful for applications requiring:

Preference-aligned text generation: Where outputs need to conform to specific quality or style preferences.
Fine-tuning for specific reward functions: Adapting to tasks where explicit feedback or preference data is available.

Overview

Model Overview

Training Details

Key Characteristics

Potential Use Cases

Full Model Card (README)