Name: jackf857/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: jackf857

Model Overview

This model, jackf857/qwen3-8b-base-epsilon-dpo-hh-helpful-4xh200-batch-64-20260424-040306, is an 8 billion parameter language model. It is a fine-tuned variant of jackf857/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452, specifically optimized using Direct Preference Optimization (DPO).

Key Differentiators

Fine-tuned for Helpfulness: The model underwent DPO training on the Anthropic/hh-rlhf dataset, which is designed to align models with human preferences for helpfulness and harmlessness.
Performance Metrics: During evaluation, it achieved a rewards accuracy of 0.7568, with a chosen reward of -0.6029 and a rejected reward of -0.8720, indicating its ability to differentiate between preferred and non-preferred responses.

Training Details

Optimization Method: Utilizes Direct Preference Optimization (DPO) for alignment.
Hyperparameters: Trained with a learning rate of 5e-07, a total batch size of 64, and a cosine learning rate scheduler over 1 epoch.

Intended Use Cases

This model is suitable for applications requiring helpful and aligned text generation, particularly in scenarios where human preference alignment is crucial. Its DPO training on the hh-rlhf dataset suggests a strong capability in generating responses that are perceived as more helpful and less harmful by humans.

Overview

Model Overview

Key Differentiators

Training Details

Intended Use Cases

Full Model Card (README)