akseljoonas/Qwen3-4B-DPO

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Jan 14, 2026Architecture:Transformer Warm

akseljoonas/Qwen3-4B-DPO is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is designed to align more closely with human preferences, offering improved response quality and helpfulness. It leverages a 40960 token context length, making it suitable for applications requiring nuanced and preference-aligned text generation.

Loading preview...

Model Overview

akseljoonas/Qwen3-4B-DPO is a 4 billion parameter language model derived from the Qwen3-4B-Instruct-2507 base model. It has been specifically fine-tuned using Direct Preference Optimization (DPO), a method that aligns the model's outputs with human preferences by leveraging a reward model implicitly. This training approach aims to enhance the model's ability to generate more desirable and helpful responses.

Key Capabilities

  • Preference-Aligned Generation: Trained with DPO, the model is optimized to produce outputs that better match human preferences, leading to higher quality and more relevant text.
  • Instruction Following: Inherits strong instruction-following capabilities from its Qwen3-4B-Instruct base, making it effective for various prompt-based tasks.
  • Extended Context Window: Features a substantial 40960 token context length, enabling it to process and generate longer, more coherent texts while maintaining context.

Training Details

The model was fine-tuned using the TRL (Transformer Reinforcement Learning) library, specifically implementing the DPO method. DPO, introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," is a robust and stable alternative to traditional reinforcement learning from human feedback (RLHF) for preference alignment.

Use Cases

This model is particularly well-suited for applications where the quality and alignment of generated text with human preferences are critical. This includes tasks such as:

  • Chatbots and Conversational AI: Generating more natural and preferred responses in dialogue systems.
  • Content Creation: Producing high-quality, preference-aligned text for articles, summaries, or creative writing.
  • Instruction-based Tasks: Excelling in scenarios where clear and helpful responses to specific instructions are required.