Name: W-61/llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: W-61

Overview

This model, llama-3-8b-base-epsilon-dpo-hh-helpful-8xh200, is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the W-61/llama-3-8b-base-sft-hh-helpful-8xh200 model, specifically optimized using Direct Preference Optimization (DPO).

Key Capabilities

Preference Alignment: Fine-tuned on the Anthropic/hh-rlhf dataset, indicating an emphasis on generating helpful and aligned responses.
DPO Training: Utilizes Direct Preference Optimization, a method known for aligning models with human preferences without complex reinforcement learning setups.
Performance Metrics: Achieved a rewards accuracy of 0.6402 and a chosen-rejected reward margin of 0.1454 on its evaluation set, demonstrating its ability to differentiate preferred responses.

Training Details

The model underwent a single epoch of training with a learning rate of 5e-07, a total batch size of 128, and utilized 8 GPUs. The training process involved an AdamW optimizer and a cosine learning rate scheduler with a 0.1 warmup ratio.

Good For

This model is suitable for use cases where generating helpful, aligned, and preference-optimized text is crucial, leveraging its DPO fine-tuning on a robust Llama-3-8B base.

Overview

Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)