jackf857/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306
The jackf857/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306 is an 8 billion parameter language model fine-tuned from jackf857/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452. It was optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, achieving a rewards accuracy of 0.7568. This model is designed for generating helpful and aligned responses, building upon its base SFT model.
Loading preview...
Model Overview
This model, jackf857/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306, is an 8 billion parameter language model. It is a fine-tuned variant of jackf857/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452, specifically optimized using Direct Preference Optimization (DPO).
Key Differentiators
- Fine-tuned for Helpfulness: The model underwent DPO training on the Anthropic/hh-rlhf dataset, which is designed to align models with human preferences for helpfulness and harmlessness.
- Performance Metrics: During evaluation, it achieved a rewards accuracy of 0.7568, with a chosen reward of -0.6029 and a rejected reward of -0.8720, indicating its ability to differentiate between preferred and non-preferred responses.
Training Details
- Optimization Method: Utilizes Direct Preference Optimization (DPO) for alignment.
- Hyperparameters: Trained with a learning rate of 5e-07, a total batch size of 64, and a cosine learning rate scheduler over 1 epoch.
Intended Use Cases
This model is suitable for applications requiring helpful and aligned text generation, particularly in scenarios where human preference alignment is crucial. Its DPO training on the hh-rlhf dataset suggests a strong capability in generating responses that are perceived as more helpful and less harmful by humans.