jackf857/qwen3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260424-040415

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 24, 2026Architecture:Transformer Cold

The jackf857/qwen3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260424-040415 model is an 8 billion parameter language model, fine-tuned from jackf857/qwen3-8b-base-sft-hh-harmless-4xh200-batch-64-20260417-214452 using Direct Preference Optimization (DPO) on the Anthropic/hh-rlhf dataset. This model is specifically optimized for harmlessness and adherence to human preferences, achieving a rewards accuracy of 0.7328 on its evaluation set. It is designed for applications requiring safe and preference-aligned text generation within a 32768 token context window.

Loading preview...

Overview

This model, jackf857/qwen3-8b-base-epsilon-dpo-hh-harmless-4xh200-batch-64-20260424-040415, is an 8 billion parameter language model developed by jackf857. It is a fine-tuned variant of jackf857/qwen3-8b-base-sft-hh-harmless-4xh200-batch-64-20260417-214452, specifically enhanced through Direct Preference Optimization (DPO).

Key Capabilities

  • Preference Alignment: Fine-tuned on the Anthropic/hh-rlhf dataset, indicating a strong focus on aligning with human preferences and generating harmless outputs.
  • DPO Training: Utilizes Direct Preference Optimization, a method known for improving model behavior based on human feedback without complex reinforcement learning setups.
  • Performance Metrics: Achieved a rewards accuracy of 0.7328 on its evaluation set, with a rewards margin of 0.3976, demonstrating its effectiveness in distinguishing preferred responses.
  • Context Window: Supports a context length of 32768 tokens, suitable for processing and generating longer sequences of text.

Good For

  • Applications requiring models that prioritize harmlessness and safety in their responses.
  • Use cases where human preference alignment is critical, such as chatbots, content moderation, or interactive AI systems.
  • Developers looking for a model fine-tuned with DPO for improved behavioral characteristics.