W-61/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 18, 2026Architecture:Transformer Cold

The W-61/llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920 model is an 8 billion parameter language model, fine-tuned from llama-3-8b-base-sft-hh-helpful-4xh200-batch-64 using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. This model is optimized for helpfulness and alignment, demonstrating a rewards accuracy of 0.7183 on the evaluation set. It is designed for applications requiring robust and aligned conversational AI, leveraging a context length of 8192 tokens.

Loading preview...

Model Overview

This model, llama-3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260418-001920, is an 8 billion parameter language model developed by W-61. It is a fine-tuned variant of the llama-3-8b-base-sft-hh-helpful-4xh200-batch-64 base model, specifically optimized using Direct Preference Optimization (DPO).

Key Characteristics

  • Fine-tuning Objective: The model was fine-tuned on the Anthropic/hh-rlhf dataset, focusing on improving helpfulness and alignment through DPO.
  • Performance Metrics: On the evaluation set, it achieved a rewards accuracy of 0.7183, with chosen rewards averaging -0.9073 and rejected rewards averaging -1.1791. The DPO loss was 0.5941.
  • Context Length: Supports an 8192-token context window, enabling processing of longer inputs and generating more coherent, extended responses.
  • Training Details: Training involved a single epoch with a learning rate of 5e-07, a total batch size of 64, and utilized the AdamW optimizer with a cosine learning rate scheduler.

Intended Use Cases

This model is suitable for applications where generating helpful, aligned, and contextually relevant text is crucial. Its DPO fine-tuning on a human feedback dataset suggests strong performance in:

  • Conversational AI: Developing chatbots or virtual assistants that provide helpful and safe responses.
  • Content Generation: Creating aligned and informative text across various domains.
  • Instruction Following: Responding accurately and helpfully to user prompts and instructions.