W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.43

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.43 is an 8 billion parameter language model fine-tuned by W-61. This model is a DPO-tuned variant of the Llama 3 8B base model, specifically optimized for helpfulness through training on the Anthropic/hh-rlhf dataset. It is designed to generate helpful and aligned responses, building upon its base model's capabilities.

Loading preview...

Model Overview

This model, W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-s_star-0.4-eta-0.1-q_t-0.43, is an 8 billion parameter language model developed by W-61. It is a fine-tuned version of the W-61/llama-3-8b-base-sft-hh-helpful-4xh200 model, further optimized using Direct Preference Optimization (DPO).

Key Characteristics

  • Base Model: Built upon the Llama 3 8B architecture.
  • Fine-tuning: Underwent DPO training on the Anthropic/hh-rlhf dataset.
  • Optimization Goal: Primarily aimed at enhancing helpfulness and alignment in its responses.

Training Details

The model was trained with the following hyperparameters:

  • Learning Rate: 5e-07
  • Batch Size: 8 (train), 8 (eval) with 2 gradient accumulation steps, resulting in a total train batch size of 64.
  • Optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08.
  • Scheduler: Cosine learning rate scheduler with a 0.1 warmup ratio.
  • Epochs: Trained for 1 epoch.

Intended Use

This model is suitable for applications requiring helpful and aligned text generation, leveraging its DPO fine-tuning on human preference data.