W-61/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-1

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-1 is an 8 billion parameter language model, fine-tuned from W-61's Llama-3-8B-base-SFT variant. This model has undergone further optimization using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, specifically targeting harmlessness and helpfulness. It is designed for applications requiring robust and safe conversational AI, building upon the Llama 3 architecture.

Loading preview...

Model Overview

This model, W-61/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-1, is an 8 billion parameter language model. It is a fine-tuned iteration of W-61/llama-3-8b-base-sft-hh-harmless-4xh200, which itself is based on the Llama 3 architecture.

Key Characteristics

  • Fine-tuning Method: Utilizes Direct Preference Optimization (DPO) for enhanced performance.
  • Dataset: Fine-tuned on the Anthropic/hh-rlhf dataset, which focuses on human feedback for helpfulness and harmlessness.
  • Base Model: Derived from a Llama 3 8B base model, indicating a strong foundation in general language understanding and generation.

Training Details

The model was trained with a learning rate of 5e-07, a total batch size of 64 (across 4 GPUs with gradient accumulation), and for 1 epoch. The optimizer used was AdamW with specific beta and epsilon parameters, and a cosine learning rate scheduler with a 0.1 warmup ratio.

Intended Use Cases

This model is particularly suited for applications where generating harmless and helpful responses is critical, such as:

  • Safe conversational agents
  • Content moderation assistance
  • Applications requiring adherence to ethical guidelines in AI interactions.