jackf857/llama-3-8b-base-new-dpo-hh-helpful-s_star0.4-4xh200-batch-64-20260421-214335-rerun

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 21, 2026Architecture:Transformer Cold

This model is a fine-tuned 8 billion parameter Llama 3 base model, developed by jackf857, specifically optimized for helpfulness through Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. It features an 8192-token context length and is derived from W-61/llama-3-8b-base-sft-hh-helpful-4xh200. The fine-tuning process aimed to enhance its ability to generate helpful responses, making it suitable for applications requiring robust and beneficial conversational AI.

Loading preview...

Overview

This model, developed by jackf857, is an 8 billion parameter Llama 3 base model that has undergone fine-tuning using Direct Preference Optimization (DPO). It is based on the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 model and was specifically trained on the Anthropic/hh-rlhf dataset to improve its helpfulness.

Key Characteristics

  • Base Model: Llama 3 8B parameters.
  • Fine-tuning Method: Direct Preference Optimization (DPO).
  • Training Data: Anthropic/hh-rlhf dataset, focusing on helpfulness.
  • Context Length: Supports an 8192-token context window.
  • Performance: Achieved a final loss of 0.6074 on the evaluation set, with specific DPO metrics indicating optimization towards preferred responses.

Training Details

The model was trained with a learning rate of 5e-07, a total batch size of 64, and for 1 epoch. The training utilized a cosine learning rate scheduler with a 0.1 warmup ratio. Evaluation metrics during training included various DPO-related scores, such as Fcm Dpo/margin and Logps/chosen, demonstrating the model's progression in aligning with helpfulness preferences.

Intended Use Cases

This model is particularly well-suited for applications where generating helpful, aligned, and preference-optimized responses is critical. Its fine-tuning on the Anthropic/hh-rlhf dataset suggests strong performance in conversational agents and assistants designed to provide beneficial information or interactions.