W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-eta-0.1-s_star-0.8-20260428-045924

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-eta-0.1-s_star-0.8-20260428-045924 is an 8 billion parameter Llama 3-based language model fine-tuned by W-61. This model is specifically optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, enhancing its helpfulness and alignment. With an 8192-token context length, it is designed for conversational AI applications requiring helpful and aligned responses.

Loading preview...

Model Overview

This model, llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-eta-0.1-s_star-0.8-20260428-045924, is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 base model.

Key Characteristics

  • Base Model: Llama 3-8B architecture.
  • Fine-tuning: Utilizes Direct Preference Optimization (DPO) for enhanced performance.
  • Training Data: Fine-tuned on the Anthropic/hh-rlhf dataset, which focuses on human feedback for helpfulness and harmlessness.
  • Context Length: Supports an 8192-token context window.

Training Details

The model was trained with the following key hyperparameters:

  • Learning Rate: 5e-07
  • Batch Size: A total training batch size of 64 (8 per device with 4 devices and 2 gradient accumulation steps).
  • Optimizer: ADAMW_TORCH with standard betas and epsilon.
  • Scheduler: Cosine learning rate scheduler with a 0.1 warmup ratio.
  • Epochs: Trained for 1 epoch.

Intended Use Cases

Given its DPO fine-tuning on a helpfulness-focused dataset, this model is particularly suited for applications where generating helpful, aligned, and human-preferred responses is critical. This includes:

  • Conversational AI: Building chatbots or virtual assistants that provide useful and relevant information.
  • Content Generation: Creating text that adheres to helpfulness guidelines.
  • Instruction Following: Executing complex instructions with a focus on generating beneficial outputs.