W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.5

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.5 is an 8 billion parameter Llama 3 base model fine-tuned by W-61. This model is specifically optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, enhancing its helpfulness and alignment. With an 8192-token context length, it is designed for generating helpful and aligned responses in conversational AI applications.

Loading preview...

Model Overview

This model, llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.5, is an 8 billion parameter variant of the Llama 3 architecture. Developed by W-61, it is a fine-tuned iteration of the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 model.

Key Capabilities

  • Preference Alignment: The model has undergone Direct Preference Optimization (DPO) using the Anthropic/hh-rlhf dataset. This training method aims to align the model's outputs more closely with human preferences, particularly for helpfulness.
  • Base Model Enhancement: It builds upon a supervised fine-tuned (SFT) Llama 3 base, further refining its response generation capabilities.

Training Details

The fine-tuning process involved specific hyperparameters:

  • Learning Rate: 5e-07
  • Batch Size: A train_batch_size of 8, with a total_train_batch_size of 64 due to gradient accumulation.
  • Optimizer: AdamW with default betas and epsilon.
  • Scheduler: Cosine learning rate scheduler with a 0.1 warmup ratio.
  • Epochs: Trained for 1 epoch.

Intended Use Cases

Given its DPO fine-tuning on a helpfulness dataset, this model is well-suited for applications requiring:

  • Generating helpful and aligned text.
  • Conversational AI where user preference and safety are important.
  • Tasks benefiting from models trained to reduce harmful or unhelpful outputs.