W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-5

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-5 is an 8 billion parameter Llama 3 base model fine-tuned by W-61. This model is specifically optimized through DPO on the Anthropic/hh-rlhf dataset, enhancing its helpfulness and alignment. With an 8192-token context length, it is designed for conversational AI and instruction-following tasks where helpful and aligned responses are critical.

Loading preview...

Model Overview

This model, W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-5, is an 8 billion parameter variant of the Llama 3 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) on the comprehensive Anthropic/hh-rlhf dataset, building upon its base model, W-61/llama-3-8b-base-sft-hh-helpful-4xh200.

Key Characteristics

  • Architecture: Llama 3 8B base model.
  • Fine-tuning: Utilizes Direct Preference Optimization (DPO) for enhanced alignment and helpfulness.
  • Training Data: Fine-tuned on the Anthropic/hh-rlhf dataset, known for its focus on human feedback and helpfulness.
  • Context Length: Supports an 8192-token context window.

Training Details

The model underwent a single epoch of training with a learning rate of 5e-07, a total batch size of 64, and a cosine learning rate scheduler with a 0.1 warmup ratio. The training leveraged a multi-GPU setup with 4 devices.

Intended Use Cases

This model is particularly well-suited for applications requiring highly aligned and helpful responses, such as:

  • Conversational AI: Building chatbots or virtual assistants that provide useful and safe interactions.
  • Instruction Following: Executing complex instructions with a focus on generating helpful outputs.
  • Content Generation: Creating text that adheres to specific helpfulness and safety guidelines.