The tanliboy/lambda-qwen2.5-14b-dpo-test is a 14.8 billion parameter language model fine-tuned from Qwen/Qwen2.5-14B-Instruct. This model utilizes a large 131,072 token context length and has been optimized using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. It is designed for tasks requiring nuanced understanding and generation based on human preferences, demonstrating improved reward metrics over its base model.
Model Overview
The tanliboy/lambda-qwen2.5-14b-dpo-test is a 14.8 billion parameter language model built upon the robust Qwen2.5-14B-Instruct architecture. This model has undergone further fine-tuning using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, aiming to align its outputs more closely with human preferences.
Key Characteristics
- Base Model: Qwen/Qwen2.5-14B-Instruct
- Parameter Count: 14.8 billion
- Context Length: 131,072 tokens
- Optimization Method: Direct Preference Optimization (DPO)
- Training Data: HuggingFaceH4/ultrafeedback_binarized dataset
Performance Highlights
During evaluation, the model achieved notable results, including a rewards accuracy of 0.7400 and a rewards margin of 0.8984, indicating its effectiveness in distinguishing between preferred and rejected responses. The training involved a learning rate of 5e-07, a total batch size of 128, and was conducted over 1 epoch using 8 GPUs.
Intended Use Cases
This model is particularly well-suited for applications where aligning with human feedback and generating high-quality, preference-aligned text is crucial. Its DPO fine-tuning makes it a strong candidate for tasks requiring nuanced response generation and improved conversational quality.