jackf857/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 24, 2026Architecture:Transformer Cold

The jackf857/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306 is an 8 billion parameter language model fine-tuned from jackf857/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452. It was optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, achieving a rewards accuracy of 0.7568. This model is designed for generating helpful and aligned responses, building upon its base SFT model.

Loading preview...

Model Overview

This model, jackf857/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306, is an 8 billion parameter language model. It is a fine-tuned variant of jackf857/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452, specifically optimized using Direct Preference Optimization (DPO).

Key Differentiators

  • Fine-tuned for Helpfulness: The model underwent DPO training on the Anthropic/hh-rlhf dataset, which is designed to align models with human preferences for helpfulness and harmlessness.
  • Performance Metrics: During evaluation, it achieved a rewards accuracy of 0.7568, with a chosen reward of -0.6029 and a rejected reward of -0.8720, indicating its ability to differentiate between preferred and non-preferred responses.

Training Details

  • Optimization Method: Utilizes Direct Preference Optimization (DPO) for alignment.
  • Hyperparameters: Trained with a learning rate of 5e-07, a total batch size of 64, and a cosine learning rate scheduler over 1 epoch.

Intended Use Cases

This model is suitable for applications requiring helpful and aligned text generation, particularly in scenarios where human preference alignment is crucial. Its DPO training on the hh-rlhf dataset suggests a strong capability in generating responses that are perceived as more helpful and less harmful by humans.