taketakedaiki/qwen3-4b-v2-exp26-dpo
The taketakedaiki/qwen3-4b-v2-exp26-dpo is a 4 billion parameter language model, fine-tuned using Direct Preference Optimization (DPO) from the Exp25 SFT base model. It features a 32768 token context length and utilizes LoRA with specific hyperparameters (r=8, alpha=16) for efficient fine-tuning. This model is designed for tasks benefiting from preference-based alignment, building upon its supervised fine-tuned predecessor.
Loading preview...
Model Overview
The taketakedaiki/qwen3-4b-v2-exp26-dpo is a 4 billion parameter language model developed by taketakedaiki. It is a DPO (Direct Preference Optimization) fine-tuned variant, building upon the previously supervised fine-tuned (SFT) taketakedaiki/qwen3-4b-v2-exp25 base model. This model is designed to align its outputs more closely with human preferences through its DPO training.
Key Characteristics
- Base Model: Fine-tuned from
taketakedaiki/qwen3-4b-v2-exp25(Exp25 SFT). - Fine-tuning Method: Utilizes Direct Preference Optimization (DPO) for alignment.
- Training Parameters: The DPO process involved a learning rate of
1e-7, a beta value of0.1, and was conducted for1 epoch. - LoRA Configuration: Employs Low-Rank Adaptation (LoRA) with
r=8andalpha=16for efficient parameter-efficient fine-tuning. - Context Length: Supports a substantial context window of 32768 tokens.
Potential Use Cases
This model is suitable for applications where preference-aligned responses are crucial, leveraging the DPO fine-tuning to generate outputs that are preferred over those from a purely supervised fine-tuned model. It can be considered for tasks requiring nuanced understanding and generation based on implicit or explicit preference data.