jackf857/qwen3-8b-base-beta-dpo-hh-harmless-4xh200-batch-64

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 20, 2026Architecture:Transformer Cold

The jackf857/qwen3-8b-base-beta-dpo-hh-harmless-4xh200-batch-64 is an 8 billion parameter language model, fine-tuned from a Qwen3-8B-base variant on the Anthropic/hh-rlhf dataset. This model has a context length of 32768 tokens and is specifically optimized for harmlessness and alignment through Beta DPO training. It is intended for applications requiring robust and safe conversational AI outputs.

Loading preview...

Model Overview

This model, jackf857/qwen3-8b-base-beta-dpo-hh-harmless-4xh200-batch-64, is an 8 billion parameter language model derived from a Qwen3-8B-base architecture. It has been fine-tuned using the Anthropic/hh-rlhf dataset with a Beta DPO (Direct Preference Optimization) training procedure, specifically targeting improved harmlessness and alignment.

Key Characteristics

  • Base Model: Fine-tuned from /scratch/qu.yang1/dynamic-dpo-v4/outputs/qwen3-8b-base-sft-hh-helpful-4xh200-batch-64-20260417-214452.
  • Training Objective: Optimized for harmlessness and alignment using the Anthropic/hh-rlhf dataset.
  • Training Method: Utilizes Beta DPO, achieving a final Beta Dpo/beta of 0.1628 and a loss of 0.6518 on the evaluation set.
  • Context Length: Supports a context window of 32768 tokens.

Intended Use Cases

This model is particularly suited for applications where generating safe, non-toxic, and aligned responses is critical. Its fine-tuning on the Anthropic/hh-rlhf dataset suggests a strong focus on reducing harmful outputs, making it a candidate for:

  • Safe conversational agents: Developing chatbots or virtual assistants that prioritize harmless interactions.
  • Content moderation: Assisting in filtering or generating content that adheres to safety guidelines.
  • Research in alignment: Exploring the effects of DPO on model behavior and safety.