W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.48

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.48 is an 8 billion parameter language model fine-tuned by W-61. It is based on a Llama 3 8B base model and further fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. This model is designed to enhance helpfulness, building upon a prior Supervised Fine-Tuning (SFT) version. It offers a context length of 8192 tokens and is optimized for generating helpful and aligned responses.

Loading preview...

Model Overview

W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.48 is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the Llama 3 8B architecture, specifically building upon the previously supervised fine-tuned model, W-61/llama-3-8b-base-sft-hh-helpful-4xh200.

Key Capabilities

  • Preference Alignment: This model has undergone an additional Direct Preference Optimization (DPO) phase using the Anthropic/hh-rlhf dataset. This training aims to align the model's outputs more closely with human preferences for helpfulness.
  • Enhanced Helpfulness: The DPO fine-tuning is specifically geared towards improving the model's ability to generate helpful and constructive responses.
  • Llama 3 Base: Benefits from the foundational capabilities and architecture of the Llama 3 8B base model.
  • Context Window: Supports a context length of 8192 tokens, allowing for processing and generating longer sequences of text.

Training Details

The model was trained with a learning rate of 5e-07, a total batch size of 64, and for 1 epoch. It utilized a cosine learning rate scheduler with a 0.1 warmup ratio. The training was performed using Transformers 4.51.0, Pytorch 2.3.1+cu121, Datasets 2.21.0, and Tokenizers 0.21.4.

Intended Use Cases

This model is particularly suitable for applications requiring:

  • Generating helpful and aligned text.
  • Tasks where human preference for response quality is critical.
  • Building conversational agents or assistants that prioritize helpfulness.