jackf857/llama-3-8b-base-margin-dpo-hh-helpful-batch-64

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 17, 2026Architecture:Transformer Cold

The jackf857/llama-3-8b-base-margin-dpo-hh-helpful-batch-64 model is an 8 billion parameter Llama 3 base model fine-tuned using Margin DPO on the Anthropic/hh-rlhf dataset. This model is optimized for helpfulness, building upon a previously SFT-tuned Llama 3 variant. It is designed for tasks requiring helpful and aligned responses, leveraging its 8192 token context length.

Loading preview...

Model Overview

This model, jackf857/llama-3-8b-base-margin-dpo-hh-helpful-batch-64, is an 8 billion parameter Llama 3 base model. It has been fine-tuned using the Margin DPO (Direct Preference Optimization) method, building upon a prior Supervised Fine-Tuning (SFT) of a Llama 3 base model. The training utilized the Anthropic/hh-rlhf dataset, which is known for aligning models with human preferences for helpfulness.

Key Training Details

  • Base Model: W-61/llama-3-8b-base-sft-hh-helpful-4xh200
  • Fine-tuning Method: Margin DPO
  • Dataset: Anthropic/hh-rlhf
  • Training Hyperparameters:
    • Learning Rate: 5e-07
    • Total Train Batch Size: 64
    • Number of Epochs: 1
  • Evaluation Metrics: Achieved a final loss of 0.4046, with a Margin Dpo/loss Margin Mean of 21.7563, indicating effective preference learning.

Intended Use Cases

This model is particularly suited for applications where generating helpful and aligned text is crucial. Its fine-tuning on the Anthropic/hh-rlhf dataset suggests a strong emphasis on producing responses that are considered helpful and harmless by human evaluators. Developers can leverage this model for tasks requiring robust and preference-aligned language generation.