W-61/llama-3-8b-base-new-dpo-hh-harmless-s_star1.0-4xh200-batch-64-20260422-051621

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 22, 2026Architecture:Transformer Cold

The W-61/llama-3-8b-base-new-dpo-hh-harmless-s_star1.0-4xh200-batch-64-20260422-051621 model is an 8 billion parameter language model, fine-tuned from W-61/llama-3-8b-base-sft-hh-harmless-4xh200. It was further optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, aiming to enhance harmlessness and alignment. This model is designed for applications requiring a robust 8B parameter base with improved safety characteristics, operating within an 8192 token context length.

Loading preview...

Model Overview

This model, llama-3-8b-base-new-dpo-hh-harmless-s_star1.0-4xh200-batch-64-20260422-051621, is an 8 billion parameter language model developed by W-61. It is a fine-tuned iteration of the W-61/llama-3-8b-base-sft-hh-harmless-4xh200 base model, specifically optimized using Direct Preference Optimization (DPO).

Key Characteristics

  • Base Model: Derived from a Llama 3 8B base variant.
  • Fine-tuning: Underwent DPO training on the Anthropic/hh-rlhf dataset, which is known for its focus on helpfulness and harmlessness.
  • Context Length: Supports an 8192 token context window.
  • Training Objective: The DPO process aimed to align the model's responses with human preferences, particularly emphasizing harmlessness.

Training Details

The model was trained for 1 epoch with a learning rate of 5e-07 and a total batch size of 64 across 4 GPUs. Evaluation metrics during training showed a final loss of 0.5433, with Fcm Dpo/beta at 0.2164 and Margin Dpo/margin Mean at 4.4651, indicating the effectiveness of the DPO fine-tuning in preference alignment.

Potential Use Cases

This model is suitable for applications where a balance between performance and safety is crucial, especially in scenarios requiring a robust 8B parameter model with enhanced harmlessness properties. It can be considered for tasks that benefit from a preference-aligned language model.