W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-eta-0.1-s_star-0.45-20260428-045924

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-eta-0.1-s_star-0.45-20260428-045924 is an 8 billion parameter language model developed by W-61, fine-tuned from W-61/llama-3-8b-base-sft-hh-helpful-4xh200. This model was fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, aiming to enhance helpfulness. It is designed for applications requiring a helpful and aligned response generation, building upon the Llama 3 base architecture with an 8192 token context length.

Loading preview...

Model Overview

This model, developed by W-61, is an 8 billion parameter language model based on the Llama 3 architecture. It is a fine-tuned variant of the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 model, specifically optimized for helpfulness.

Training Details

The model underwent a fine-tuning process using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. Key training hyperparameters included a learning rate of 5e-07, a total training batch size of 64, and a single epoch. The optimizer used was AdamW with specific beta and epsilon values, and a cosine learning rate scheduler with a 0.1 warmup ratio was applied. The training was distributed across 4 GPUs.

Key Characteristics

  • Base Model: Llama 3 8B
  • Fine-tuning Method: Direct Preference Optimization (DPO)
  • Dataset: Anthropic/hh-rlhf, focusing on helpfulness alignment
  • Context Length: 8192 tokens

Intended Use Cases

This model is particularly suited for applications where generating helpful and aligned responses is critical. Its DPO fine-tuning on a human feedback dataset suggests improved performance in conversational agents, content generation, and question-answering systems that prioritize user assistance and beneficial output.