jackf857/qwen3-8b-base-beta-dpo-hh-helpful-4xh200-batch-64-20260424-013732
The jackf857/qwen3-8b-base-beta-dpo-hh-helpful-4xh200-batch-64-20260424-013732 is an 8 billion parameter Qwen3-based language model, fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. This model is optimized for helpfulness, building upon a supervised fine-tuned base. It features a 32K context length and is designed for general-purpose conversational AI where helpful and aligned responses are critical.
Loading preview...
Model Overview
This model, jackf857/qwen3-8b-base-beta-dpo-hh-helpful-4xh200-batch-64-20260424-013732, is an 8 billion parameter language model based on the Qwen3 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset, specifically targeting helpfulness in its responses. This DPO fine-tuning process aims to align the model's outputs with human preferences for helpfulness, building upon a previously supervised fine-tuned version.
Key Characteristics
- Architecture: Qwen3-based, 8 billion parameters.
- Fine-tuning: Utilizes Direct Preference Optimization (DPO) for alignment.
- Dataset: Fine-tuned on the Anthropic/hh-rlhf dataset, emphasizing helpfulness.
- Context Length: Supports a context window of 32,768 tokens.
Training Details
The model underwent a single epoch of training with a learning rate of 5e-07 and a total batch size of 64 across 4 devices. The training process achieved a final loss of 0.6505 and a Beta Dpo/gap Mean of 25.7183, indicating successful preference learning. The training leveraged Transformers 4.51.0, Pytorch 2.3.1+cu121, Datasets 2.21.0, and Tokenizers 0.21.4.
Intended Use Cases
This model is suitable for applications requiring a helpful and aligned conversational AI, particularly in scenarios where generating responses that adhere to human preferences for assistance and utility is important.