Name: jackf857/qwen3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.85 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: jackf857

Model Overview

This model, qwen3-8b-base-new-dpo-hh-helpful-4xh200-batch-64-q_t-0.45-s_star-0.85, is an 8 billion parameter language model based on the Qwen3 architecture. It was developed by jackf857 and represents a further fine-tuning of a previously supervised fine-tuned (SFT) model, specifically jackf857/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452.

Key Capabilities

Direct Preference Optimization (DPO): The model has undergone DPO training, a method known for aligning language models with human preferences, particularly for helpfulness and harmlessness.
Anthropic/hh-rlhf Dataset: Fine-tuned on the Anthropic/hh-rlhf dataset, which is designed to improve models' ability to generate responses that are both helpful and avoid harmful content.
Performance Metrics: Achieved a final loss of 0.4843 on the evaluation set, with specific DPO margin metrics indicating successful preference learning.

Training Details

The model was trained for 1 epoch with a learning rate of 5e-07, using a total batch size of 64 across 4 GPUs. The optimizer used was AdamW with a cosine learning rate scheduler and a warmup ratio of 0.1. The training process involved 600 steps, showing consistent improvement in validation loss and DPO-related metrics.

Overview

Model Overview

Key Capabilities

Training Details

Full Model Card (README)