jackf857/qwen3-8b-base-new-dpo-hh-harmless-4xh200-batch-64-q_t-0.45-s_star-0.85

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 26, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

This is an 8 billion parameter language model, fine-tuned by jackf857 using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. It is based on a Qwen3-8B-base model and is specifically optimized for harmlessness and alignment with human preferences. The model is intended for applications requiring safe and preference-aligned text generation.

Loading preview...

Model Overview

This model, developed by jackf857, is an 8 billion parameter language model fine-tuned using Direct Preference Optimization (DPO). It builds upon a Qwen3-8B-base model that was initially instruction-tuned for harmlessness (jackf857/qwen3-8b-base-sft-hh-harmless-4xh200-batch-64-20260417-214452). The DPO fine-tuning process specifically utilized the Anthropic/hh-rlhf dataset to further enhance its alignment with human preferences, particularly focusing on generating harmless responses.

Key Characteristics

  • Base Model: Qwen3-8B-base architecture.
  • Fine-tuning Method: Direct Preference Optimization (DPO).
  • Dataset: Anthropic/hh-rlhf, emphasizing harmlessness.
  • Parameter Count: 8 billion parameters.
  • Context Length: 32768 tokens.

Training Details

The model underwent a single epoch of DPO training with a learning rate of 5e-07, a total batch size of 64, and a cosine learning rate scheduler. Evaluation metrics, including DPO loss and margin, indicate the model's performance in aligning with preferred responses over rejected ones.

Intended Use Cases

This model is particularly suited for applications where generating safe, non-toxic, and preference-aligned text is crucial. Its DPO fine-tuning on a harmlessness dataset makes it a strong candidate for conversational AI, content moderation, and other scenarios requiring robust safety guardrails.