Name: jackf857/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.5-s_star-0.6 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: jackf857

Model Overview

This model, llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.5-s_star-0.6, is an 8 billion parameter Llama 3 base model. It has been fine-tuned by jackf857 using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, which is known for aligning models towards helpfulness.

Key Training Details

Base Model: Fine-tuned from W-61/llama-3-8b-base-sft-hh-helpful-4xh200.
Optimization Method: Direct Preference Optimization (DPO).
Dataset: Anthropic/hh-rlhf, indicating a focus on helpful and harmless responses.
Training Hyperparameters:
- Learning Rate: 5e-07
- Optimizer: ADAMW_TORCH
- Epochs: 1
- Total Train Batch Size: 64 (across 4 GPUs)

Performance Metrics

During training, the model achieved a final validation loss of 0.5691. Key DPO-specific metrics include a margin mean of 152.7355 and a beta of 0.0027, reflecting the preference learning process.

Intended Use Cases

Given its DPO fine-tuning on a helpfulness dataset, this model is primarily intended for applications where generating helpful, aligned, and preference-tuned text is crucial. It is suitable for tasks requiring responses that adhere to human preferences for assistance and utility.

Overview

Model Overview

Key Training Details

Performance Metrics

Intended Use Cases

Full Model Card (README)