W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-8

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-8 is an 8 billion parameter language model fine-tuned by W-61, based on the Llama 3 architecture. This model was specifically fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, aiming to enhance helpfulness and align with human preferences. It is designed for conversational AI applications requiring helpful and aligned responses, leveraging a context length of 8192 tokens.

Loading preview...

Model Overview

This model, llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-8, is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 base model.

Key Characteristics

  • Base Model: Llama 3 8B architecture.
  • Fine-tuning Method: Utilizes Direct Preference Optimization (DPO).
  • Training Data: Fine-tuned on the Anthropic/hh-rlhf dataset, which is designed to improve helpfulness and reduce harmfulness through human feedback.
  • Context Length: Supports a context window of 8192 tokens.

Training Details

The model underwent a single epoch of training with a learning rate of 5e-07 and a total batch size of 64 across 4 GPUs. The optimizer used was AdamW with cosine learning rate scheduling and a warmup ratio of 0.1.

Intended Use Cases

This model is primarily intended for applications where generating helpful, harmless, and aligned responses is crucial. Its DPO fine-tuning on a human feedback dataset suggests suitability for:

  • Chatbots and conversational AI systems.
  • Assistants requiring helpful and preference-aligned outputs.
  • Tasks benefiting from improved instruction following and safety.