jackf857/qwen3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.6
The jackf857/qwen3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.6 model is an 8 billion parameter language model fine-tuned by jackf857. It is based on a Qwen3-8B-Base variant and has been optimized using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. This fine-tuning process aims to enhance helpfulness and alignment, making it suitable for conversational AI and instruction-following tasks. The model demonstrates a validation loss of 0.5224 after training for one epoch.
Loading preview...
Model Overview
This model, jackf857/qwen3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.6, is an 8 billion parameter language model developed by jackf857. It is a fine-tuned version of jackf857/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452.
Fine-tuning Details
The model underwent fine-tuning using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. This process is designed to align the model's outputs with human preferences, particularly for helpfulness.
Training Hyperparameters
Key hyperparameters used during training include:
- Learning Rate: 5e-07
- Optimizer: AdamW_Torch with betas=(0.9, 0.999)
- Batch Size: 8 (train and eval), with a total effective train batch size of 64 due to gradient accumulation
- Epochs: 1
- Distributed Training: Multi-GPU setup with 4 devices
Performance Metrics
Upon evaluation, the model achieved a final validation loss of 0.5224. Other notable metrics from the DPO training include:
- Fcm Dpo/beta: 0.0073
- Margin Dpo/margin Mean: 73.6023
- Logps/chosen: -265.1147
- Logps/rejected: -332.2911
This fine-tuning aims to improve the model's ability to generate responses that are preferred by humans, making it potentially more effective in interactive and helpful AI applications.