Name: jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: jackf857

Model Overview

This model, jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star0.6-4xh200-batch-64-20260421-214335-rerun, is an 8 billion parameter Llama 3 variant. It has been fine-tuned from the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 model using Direct Preference Optimization (DPO).

Key Characteristics

Base Model: Llama 3 8B.
Fine-tuning Method: Direct Preference Optimization (DPO).
Training Data: Primarily fine-tuned on the Anthropic/hh-rlhf dataset, which focuses on human feedback for helpfulness.
Optimization Goal: Enhanced helpfulness and alignment in generated responses.

Training Details

The model underwent 1 epoch of training with a learning rate of 5e-07, a train_batch_size of 8, and a gradient_accumulation_steps of 2, resulting in a total_train_batch_size of 64. The optimizer used was AdamW_Torch with cosine learning rate scheduling. Evaluation metrics show a final loss of 0.5572 and a mean DPO margin of 156.2329, indicating successful preference learning.

Intended Use Cases

This model is suitable for applications requiring helpful and aligned text generation, particularly in scenarios where human preference for assistance is paramount. Its DPO fine-tuning on the hh-rlhf dataset suggests strong performance in conversational agents, content generation, and question-answering systems where helpfulness is a key metric.

Overview

Model Overview

Key Characteristics

Training Details

Intended Use Cases

Full Model Card (README)