Name: jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star0.4-4xh200-batch-64-20260421-214335-rerun API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: jackf857

Overview

This model, developed by jackf857, is an 8 billion parameter Llama 3 base model that has undergone fine-tuning using Direct Preference Optimization (DPO). It is based on the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 model and was specifically trained on the Anthropic/hh-rlhf dataset to improve its helpfulness.

Key Characteristics

Base Model: Llama 3 8B parameters.
Fine-tuning Method: Direct Preference Optimization (DPO).
Training Data: Anthropic/hh-rlhf dataset, focusing on helpfulness.
Context Length: Supports an 8192-token context window.
Performance: Achieved a final loss of 0.6074 on the evaluation set, with specific DPO metrics indicating optimization towards preferred responses.

Training Details

The model was trained with a learning rate of 5e-07, a total batch size of 64, and for 1 epoch. The training utilized a cosine learning rate scheduler with a 0.1 warmup ratio. Evaluation metrics during training included various DPO-related scores, such as Fcm Dpo/margin and Logps/chosen, demonstrating the model's progression in aligning with helpfulness preferences.

Intended Use Cases

This model is particularly well-suited for applications where generating helpful, aligned, and preference-optimized responses is critical. Its fine-tuning on the Anthropic/hh-rlhf dataset suggests strong performance in conversational agents and assistants designed to provide beneficial information or interactions.

Overview

Overview

Key Characteristics

Training Details

Intended Use Cases

Full Model Card (README)