jackf857/qwen3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.85

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 26, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

This is an 8 billion parameter Qwen3-based language model, fine-tuned by jackf857 using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. It is optimized for helpfulness and safety, building upon a supervised fine-tuned base model. The model has a context length of 32768 tokens and is designed for general conversational AI applications where helpful and harmless responses are prioritized.

Loading preview...

Model Overview

This model, qwen3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.85, is an 8 billion parameter language model based on the Qwen3 architecture. It was developed by jackf857 and represents a further fine-tuning of a previously supervised fine-tuned (SFT) model, specifically jackf857/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452.

Key Capabilities

  • Direct Preference Optimization (DPO): The model has undergone DPO training, a method known for aligning language models with human preferences, particularly for helpfulness and harmlessness.
  • Anthropic/hh-rlhf Dataset: Fine-tuned on the Anthropic/hh-rlhf dataset, which is designed to improve models' ability to generate responses that are both helpful and avoid harmful content.
  • Performance Metrics: Achieved a final loss of 0.4843 on the evaluation set, with specific DPO margin metrics indicating successful preference learning.

Training Details

The model was trained for 1 epoch with a learning rate of 5e-07, using a total batch size of 64 across 4 GPUs. The optimizer used was AdamW with a cosine learning rate scheduler and a warmup ratio of 0.1. The training process involved 600 steps, showing consistent improvement in validation loss and DPO-related metrics.