jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star1.0-4xh200-batch-64-20260421-233802

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 22, 2026Architecture:Transformer Cold

The jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star1.0-4xh200-batch-64-20260421-233802 is an 8 billion parameter language model, fine-tuned from W-61/llama-3-8b-base-sft-hh-helpful-4xh200 using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. This model is optimized for generating helpful and harmless responses, demonstrating improved alignment through its DPO training. It is suitable for applications requiring robust conversational AI with a focus on safety and utility.

Loading preview...

Model Overview

This model, jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star1.0-4xh200-batch-64-20260421-233802, is an 8 billion parameter language model derived from the Llama 3 architecture. It has been specifically fine-tuned using Direct Preference Optimization (DPO) on the comprehensive Anthropic/hh-rlhf dataset, which is designed to enhance helpfulness and harmlessness in AI responses. The base model for this fine-tuning was W-61/llama-3-8b-base-sft-hh-helpful-4xh200.

Key Training Details

The training process involved a learning rate of 5e-07, a total batch size of 64 (across 4 GPUs with gradient accumulation), and a cosine learning rate scheduler with a 0.1 warmup ratio over 1 epoch. Evaluation metrics during training showed a final validation loss of 0.4980, with significant improvements in DPO-related metrics such as Fcm Dpo/margin reaching 120.6791.

Intended Use Cases

This model is particularly well-suited for applications where generating aligned, helpful, and harmless text is critical. Its DPO fine-tuning on the Anthropic/hh-rlhf dataset makes it a strong candidate for:

  • Conversational AI: Building chatbots or virtual assistants that prioritize safe and useful interactions.
  • Content Generation: Creating text that adheres to ethical guidelines and provides constructive information.
  • Response Filtering: As a component in systems that aim to reduce harmful or unhelpful outputs from other language models.