Name: jackf857/qwen3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.6 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: jackf857

Model Overview

This model, jackf857/qwen3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.6, is an 8 billion parameter language model developed by jackf857. It is a fine-tuned version of jackf857/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452.

Fine-tuning Details

The model underwent fine-tuning using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. This process is designed to align the model's outputs with human preferences, particularly for helpfulness.

Training Hyperparameters

Key hyperparameters used during training include:

Learning Rate: 5e-07
Optimizer: AdamW_Torch with betas=(0.9, 0.999)
Batch Size: 8 (train and eval), with a total effective train batch size of 64 due to gradient accumulation
Epochs: 1
Distributed Training: Multi-GPU setup with 4 devices

Performance Metrics

Upon evaluation, the model achieved a final validation loss of 0.5224. Other notable metrics from the DPO training include:

Fcm Dpo/beta: 0.0073
Margin Dpo/margin Mean: 73.6023
Logps/chosen: -265.1147
Logps/rejected: -332.2911

This fine-tuning aims to improve the model's ability to generate responses that are preferred by humans, making it potentially more effective in interactive and helpful AI applications.

Overview

Model Overview

Fine-tuning Details

Training Hyperparameters

Performance Metrics

Full Model Card (README)