jackf857/llama-3-8b-base-margin-dpo-hh-4xh100

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 5, 2026License:llama3Architecture:Transformer Cold

The jackf857/llama-3-8b-base-margin-dpo-hh-4xh100 model is an 8 billion parameter Llama 3 base model, fine-tuned from W-61/llama-3-8b-base-hh-harmless-sft-4xh100. It was trained using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, focusing on aligning with human preferences for helpfulness and harmlessness. This model is designed for applications requiring a robust, preference-aligned language model with an 8192 token context length.

Loading preview...

Model Overview

The jackf857/llama-3-8b-base-margin-dpo-hh-4xh100 is an 8 billion parameter language model based on the Llama 3 architecture. It is a fine-tuned variant of the W-61/llama-3-8b-base-hh-harmless-sft-4xh100 model, specifically optimized using Direct Preference Optimization (DPO).

Key Training Details

This model underwent a single epoch of training on the Anthropic/hh-rlhf dataset, which is known for its focus on human feedback for helpfulness and harmlessness. The training utilized a learning rate of 5e-07, a batch size of 4 across 4 GPUs, and a cosine learning rate scheduler with a 0.1 warmup ratio. The total effective training batch size was 128, employing gradient accumulation steps of 8.

Intended Use Cases

Given its DPO fine-tuning on a human preference dataset, this model is likely suitable for applications where alignment with human values, particularly in terms of generating helpful and harmless responses, is critical. Developers can leverage this model for tasks requiring a preference-aligned Llama 3 variant.