yunjae-won/ubq30i_qwen4b_dpo_topk20_j0
The yunjae-won/ubq30i_qwen4b_dpo_topk20_j0 model is a 4 billion parameter language model, fine-tuned from yunjae-won/ubq30i_qwen4b_sft_both using Direct Preference Optimization (DPO). This model leverages DPO to align its outputs with human preferences, building upon a base model with a 32768 token context length. It is designed for generating text that is preferred by human evaluators, making it suitable for conversational AI and response generation tasks.
Loading preview...
Model Overview
The yunjae-won/ubq30i_qwen4b_dpo_topk20_j0 is a 4 billion parameter language model developed by yunjae-won. It is a fine-tuned variant of the yunjae-won/ubq30i_qwen4b_sft_both model, specifically optimized using Direct Preference Optimization (DPO). DPO is a training method that aligns language model outputs with human preferences by treating the preference data as implicit rewards, as detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (paper link).
Key Characteristics
- Architecture: Based on the Qwen4B family, with 4 billion parameters.
- Training Method: Utilizes Direct Preference Optimization (DPO) for preference alignment.
- Context Length: Supports a substantial context window of 32768 tokens.
- Framework: Trained using the TRL (Transformers Reinforcement Learning) library.
Intended Use Cases
This model is particularly well-suited for applications requiring:
- Preference-aligned text generation: Generating responses that are more likely to be preferred by users.
- Conversational AI: Enhancing the quality and naturalness of dialogue systems.
- Instruction following: Producing outputs that better adhere to given instructions due to DPO fine-tuning.