taketakedaiki/qwen3-4b-v2-exp26-dpo

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The taketakedaiki/qwen3-4b-v2-exp26-dpo is a 4 billion parameter language model, fine-tuned using Direct Preference Optimization (DPO) from the Exp25 SFT base model. It features a 32768 token context length and utilizes LoRA with specific hyperparameters (r=8, alpha=16) for efficient fine-tuning. This model is designed for tasks benefiting from preference-based alignment, building upon its supervised fine-tuned predecessor.

Loading preview...

Model Overview

The taketakedaiki/qwen3-4b-v2-exp26-dpo is a 4 billion parameter language model developed by taketakedaiki. It is a DPO (Direct Preference Optimization) fine-tuned variant, building upon the previously supervised fine-tuned (SFT) taketakedaiki/qwen3-4b-v2-exp25 base model. This model is designed to align its outputs more closely with human preferences through its DPO training.

Key Characteristics

  • Base Model: Fine-tuned from taketakedaiki/qwen3-4b-v2-exp25 (Exp25 SFT).
  • Fine-tuning Method: Utilizes Direct Preference Optimization (DPO) for alignment.
  • Training Parameters: The DPO process involved a learning rate of 1e-7, a beta value of 0.1, and was conducted for 1 epoch.
  • LoRA Configuration: Employs Low-Rank Adaptation (LoRA) with r=8 and alpha=16 for efficient parameter-efficient fine-tuning.
  • Context Length: Supports a substantial context window of 32768 tokens.

Potential Use Cases

This model is suitable for applications where preference-aligned responses are crucial, leveraging the DPO fine-tuning to generate outputs that are preferred over those from a purely supervised fine-tuned model. It can be considered for tasks requiring nuanced understanding and generation based on implicit or explicit preference data.