W-61/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.5

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.5 is an 8 billion parameter language model developed by W-61, fine-tuned from a Llama 3 base model. It was specifically optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset to enhance harmlessness. This model is intended for applications requiring a robust 8B parameter LLM with improved safety characteristics, operating with an 8192-token context length.

Loading preview...

Overview

This model, developed by W-61, is an 8 billion parameter language model based on the Llama 3 architecture. It is a fine-tuned variant of W-61/llama-3-8b-base-sft-hh-harmless-4xh200, specifically optimized using Direct Preference Optimization (DPO).

Key Characteristics

  • Base Model: Llama 3 8B
  • Fine-tuning Method: Direct Preference Optimization (DPO)
  • Training Dataset: Anthropic/hh-rlhf, focusing on harmlessness.
  • Context Length: 8192 tokens.
  • Training Hyperparameters: Utilized a learning rate of 5e-07, a total batch size of 64, and a cosine learning rate scheduler over 1 epoch.

Intended Use Cases

This model is suitable for applications where a balance of performance and enhanced safety, particularly in generating harmless responses, is critical. Its DPO fine-tuning on the Anthropic/hh-rlhf dataset suggests an emphasis on reducing undesirable outputs, making it potentially useful for:

  • Content moderation assistance: Aiding in filtering or flagging potentially harmful content.
  • Safe conversational AI: Developing chatbots or virtual assistants designed to avoid generating harmful or biased text.
  • Research into DPO and safety alignment: Serving as a base for further experimentation and evaluation of alignment techniques.