W-61/llama-3-8b-base-beta-dpo-hh-helpful-8xh200

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 11, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-beta-dpo-hh-helpful-8xh200 is an 8 billion parameter language model developed by W-61, fine-tuned from W-61/llama-3-8b-base-sft-hh-helpful-8xh200. This model has undergone further optimization using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. It is designed to enhance helpfulness and alignment, building upon its base Llama 3 architecture with a context length of 8192 tokens.

Loading preview...

Model Overview

This model, W-61/llama-3-8b-base-beta-dpo-hh-helpful-8xh200, is an 8 billion parameter language model developed by W-61. It is a fine-tuned iteration of the W-61/llama-3-8b-base-sft-hh-helpful-8xh200 model, specifically optimized using Direct Preference Optimization (DPO).

Key Characteristics

  • Base Model: Fine-tuned from a Llama 3 8B base model.
  • Fine-tuning: Utilizes Direct Preference Optimization (DPO) for alignment.
  • Dataset: Trained on the Anthropic/hh-rlhf dataset, indicating a focus on helpfulness and harmlessness.
  • Context Length: Supports a context window of 8192 tokens.

Training Details

The model was trained with a learning rate of 5e-07 over 1 epoch, using a total batch size of 128 across 8 GPUs. Evaluation metrics show a final loss of 0.6427 and a Beta Dpo/gap Mean of 20.0887, suggesting improved alignment during the DPO phase.

Potential Use Cases

Given its DPO fine-tuning on a helpfulness dataset, this model is likely suitable for applications requiring:

  • Helpful and aligned responses: Generating user-friendly and constructive text.
  • General-purpose conversational AI: Where safety and helpfulness are priorities.
  • Further research into DPO and alignment techniques: As a base for experimental work.