W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.5

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 28, 2026Architecture:Transformer Cold

W-61/llama-3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.4-eta-0.5 is an 8 billion parameter language model developed by W-61, fine-tuned from llama-3-8b-base-sft-hh-helpful-4xh200. This model has been optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, enhancing its helpfulness and alignment. With an 8192-token context length, it is designed for applications requiring helpful and aligned text generation.

Loading preview...

Model Overview

This model, developed by W-61, is an 8 billion parameter language model fine-tuned from the llama-3-8b-base-sft-hh-helpful-4xh200 base model. It leverages Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset to improve its helpfulness and alignment with human preferences. The training involved a single epoch with a learning rate of 5e-07 and a total batch size of 64, utilizing a cosine learning rate scheduler.

Key Characteristics

  • Base Model: Fine-tuned from llama-3-8b-base-sft-hh-helpful-4xh200.
  • Optimization Method: Utilizes Direct Preference Optimization (DPO).
  • Training Data: Optimized on the Anthropic/hh-rlhf dataset, focusing on helpfulness and harmlessness.
  • Parameter Count: 8 billion parameters.
  • Context Length: Supports an 8192-token context window.

Training Details

The model was trained with specific hyperparameters including a learning rate of 5e-07, a train_batch_size of 8, and a gradient_accumulation_steps of 2, resulting in an effective total_train_batch_size of 64. The optimizer used was AdamW with default betas and epsilon, and a cosine learning rate scheduler with a 0.1 warmup ratio was applied over 1 epoch.

Intended Use Cases

This model is suitable for applications where generating helpful, aligned, and preference-optimized text is crucial, particularly in conversational AI or assistant-like roles where human feedback has been incorporated into its training.