Wiihuyng/qwen-0.5b-dpo-humanlike
Wiihuyng/qwen-0.5b-dpo-humanlike is a 0.5 billion parameter causal language model, fine-tuned by Wiihuyng using Direct Preference Optimization (DPO) on a base model from the Qwen family. This model specializes in generating human-like responses, building upon its supervised fine-tuned predecessor. With a context length of 32768 tokens, it is designed for conversational AI and tasks requiring nuanced, preference-aligned text generation.
Loading preview...
Model Overview
Wiihuyng/qwen-0.5b-dpo-humanlike is a 0.5 billion parameter language model developed by Wiihuyng, building upon the Wiihuyng/qwen-0.5b-sft-humanlike base model. This iteration has been further fine-tuned using Direct Preference Optimization (DPO), a method designed to align language models with human preferences without the need for a separate reward model. The training was conducted using the TRL (Transformers Reinforcement Learning) framework.
Key Capabilities
- Human-like Response Generation: Optimized to produce outputs that align closely with human preferences and conversational styles.
- Preference Alignment: Leverages DPO for effective alignment, making it suitable for interactive applications where user satisfaction is key.
- Efficient Size: At 0.5 billion parameters, it offers a balance between performance and computational efficiency.
- Extended Context Window: Supports a context length of 32768 tokens, allowing for more extensive and coherent interactions.
Training Details
The model's training procedure utilized DPO, as introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (paper link). This method directly optimizes a policy to maximize the likelihood of preferred responses over dispreferred ones, based on a dataset of human preferences. The training environment included TRL 1.5.1, Transformers 5.9.0, Pytorch 2.10.0+cu128, Datasets 4.8.5, and Tokenizers 0.22.2.