jackf857/qwen3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.85
This is an 8 billion parameter Qwen3-based language model, fine-tuned by jackf857 using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. It is optimized for helpfulness and safety, building upon a supervised fine-tuned base model. The model has a context length of 32768 tokens and is designed for general conversational AI applications where helpful and harmless responses are prioritized.
Loading preview...
Model Overview
This model, qwen3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.85, is an 8 billion parameter language model based on the Qwen3 architecture. It was developed by jackf857 and represents a further fine-tuning of a previously supervised fine-tuned (SFT) model, specifically jackf857/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452.
Key Capabilities
- Direct Preference Optimization (DPO): The model has undergone DPO training, a method known for aligning language models with human preferences, particularly for helpfulness and harmlessness.
- Anthropic/hh-rlhf Dataset: Fine-tuned on the Anthropic/hh-rlhf dataset, which is designed to improve models' ability to generate responses that are both helpful and avoid harmful content.
- Performance Metrics: Achieved a final loss of 0.4843 on the evaluation set, with specific DPO margin metrics indicating successful preference learning.
Training Details
The model was trained for 1 epoch with a learning rate of 5e-07, using a total batch size of 64 across 4 GPUs. The optimizer used was AdamW with a cosine learning rate scheduler and a warmup ratio of 0.1. The training process involved 600 steps, showing consistent improvement in validation loss and DPO-related metrics.