W-61/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.43

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

The W-61/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.43 model is an 8 billion parameter language model, fine-tuned from W-61/llama-3-8b-base-sft-hh-harmless-4xh200. It was fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, aiming to enhance harmlessness and align with human preferences. This model is designed for applications requiring a robust and safety-aligned conversational AI with an 8192 token context length.

Loading preview...

Model Overview

This model, llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.43, is an 8 billion parameter language model derived from the Llama 3 architecture. It has been specifically fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, which is known for its focus on helpfulness and harmlessness. The base model for this fine-tuning was W-61/llama-3-8b-base-sft-hh-harmless-4xh200.

Key Characteristics

  • Parameter Count: 8 billion parameters.
  • Context Length: Supports an 8192 token context window.
  • Fine-tuning Method: Utilizes Direct Preference Optimization (DPO) for alignment.
  • Training Data: Fine-tuned on the Anthropic/hh-rlhf dataset, emphasizing safety and helpfulness.
  • Training Hyperparameters: Key hyperparameters include a learning rate of 5e-07, a total training batch size of 64, and training for 1 epoch.

Intended Use Cases

This model is suitable for applications where generating harmless and aligned text is a priority. Its DPO fine-tuning on the Anthropic/hh-rlhf dataset suggests an optimization for:

  • Safety-critical applications: Where avoiding harmful or biased outputs is crucial.
  • Conversational AI: For chatbots and virtual assistants that need to maintain helpful and harmless interactions.
  • Content moderation: Assisting in filtering or generating content that adheres to safety guidelines.