Name: Wiihuyng/qwen-0.5b-dpo-humanlike API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Wiihuyng

Model Overview

Wiihuyng/qwen-0.5b-dpo-humanlike is a 0.5 billion parameter language model developed by Wiihuyng, building upon the Wiihuyng/qwen-0.5b-sft-humanlike base model. This iteration has been further fine-tuned using Direct Preference Optimization (DPO), a method designed to align language models with human preferences without the need for a separate reward model. The training was conducted using the TRL (Transformers Reinforcement Learning) framework.

Key Capabilities

Human-like Response Generation: Optimized to produce outputs that align closely with human preferences and conversational styles.
Preference Alignment: Leverages DPO for effective alignment, making it suitable for interactive applications where user satisfaction is key.
Efficient Size: At 0.5 billion parameters, it offers a balance between performance and computational efficiency.
Extended Context Window: Supports a context length of 32768 tokens, allowing for more extensive and coherent interactions.

Training Details

The model's training procedure utilized DPO, as introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (paper link). This method directly optimizes a policy to maximize the likelihood of preferred responses over dispreferred ones, based on a dataset of human preferences. The training environment included TRL 1.5.1, Transformers 5.9.0, Pytorch 2.10.0+cu128, Datasets 4.8.5, and Tokenizers 0.22.2.

Overview

Model Overview

Key Capabilities

Training Details

Full Model Card (README)