W-61/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.01

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.01 is an 8 billion parameter language model developed by W-61, fine-tuned from a Llama-3-8B base model. This iteration specifically applies DPO (Direct Preference Optimization) using the Anthropic/hh-rlhf dataset, focusing on enhancing harmlessness. It is designed for applications requiring a robust, safety-aligned language model with 8K context length.

Loading preview...

Model Overview

This model, developed by W-61, is an 8 billion parameter language model built upon the Llama-3-8B architecture. It represents a fine-tuned version of the W-61/llama-3-8b-base-sft-hh-harmless-4xh200 model.

Key Differentiator

The primary distinction of this model lies in its training methodology. It has undergone Direct Preference Optimization (DPO) using the Anthropic/hh-rlhf dataset. This specific fine-tuning process is aimed at significantly improving the model's harmlessness and alignment with human preferences, making it suitable for applications where safety and ethical responses are paramount.

Training Details

Training involved a learning rate of 5e-07, a total batch size of 64, and utilized a cosine learning rate scheduler with 0.1 warmup ratio over 1 epoch. The optimization was performed using AdamW with specific beta and epsilon parameters. The model was trained across 4 GPUs.

Potential Use Cases

  • Safety-critical applications: Where generating harmless and ethically aligned content is a priority.
  • Content moderation: Assisting in filtering or generating safe text.
  • Dialogue systems: Creating chatbots or virtual assistants that prioritize non-toxic and helpful interactions.