W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.3

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.3 is an 8 billion parameter language model, fine-tuned by W-61. This model is a DPO-tuned variant of the Llama 3 8B base model, specifically optimized using the Anthropic/hh-rlhf dataset. It is designed for helpful conversational tasks, building upon a previously instruction-tuned version. The model leverages a context length of 8192 tokens, making it suitable for applications requiring nuanced and helpful responses.

Loading preview...

Model Overview

This model, W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.3, is an 8 billion parameter language model developed by W-61. It is a fine-tuned version of W-61/llama-3-8b-base-sft-hh-helpful-4xh200, which itself was a supervised fine-tuned (SFT) variant of the Llama 3 8B base model.

Key Capabilities

  • DPO Fine-tuning: The model has undergone Direct Preference Optimization (DPO) using the Anthropic/hh-rlhf dataset. This process aims to align the model's outputs more closely with human preferences for helpfulness.
  • Helpful Responses: Building on its SFT predecessor, this DPO-tuned model is specifically geared towards generating helpful and aligned conversational outputs.
  • Llama 3 Architecture: Inherits the foundational capabilities and performance characteristics of the Llama 3 8B base model.
  • Context Length: Supports a context window of 8192 tokens, allowing for processing and generating longer sequences of text.

Training Details

The model was trained with a learning rate of 5e-07 over 1 epoch, utilizing a total batch size of 64 across 4 GPUs. The optimizer used was AdamW with cosine learning rate scheduling and a 0.1 warmup ratio. This training regimen was designed to enhance the model's ability to provide helpful and preference-aligned responses.

Intended Use Cases

This model is particularly well-suited for applications requiring:

  • Helpful AI Assistants: Generating informative and user-friendly responses in conversational agents.
  • Preference-Aligned Generation: Tasks where the output needs to adhere to specific helpfulness criteria, as learned from the HH-RLHF dataset.
  • General Text Generation: Leveraging the Llama 3 8B base capabilities for a wide range of language understanding and generation tasks, with an emphasis on helpfulness.