tanliboy/lambda-qwen2.5-14b-dpo-test

Warm
Public
14.8B
FP8
32768
1
Sep 20, 2024
License: apache-2.0
Hugging Face

The tanliboy/lambda-qwen2.5-14b-dpo-test is a 14.8 billion parameter language model fine-tuned from Qwen/Qwen2.5-14B-Instruct. This model utilizes a large 131,072 token context length and has been optimized using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. It is designed for tasks requiring nuanced understanding and generation based on human preferences, demonstrating improved reward metrics over its base model.

Overview

Model Overview

The tanliboy/lambda-qwen2.5-14b-dpo-test is a 14.8 billion parameter language model built upon the robust Qwen2.5-14B-Instruct architecture. This model has undergone further fine-tuning using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset, aiming to align its outputs more closely with human preferences.

Key Characteristics

  • Base Model: Qwen/Qwen2.5-14B-Instruct
  • Parameter Count: 14.8 billion
  • Context Length: 131,072 tokens
  • Optimization Method: Direct Preference Optimization (DPO)
  • Training Data: HuggingFaceH4/ultrafeedback_binarized dataset

Performance Highlights

During evaluation, the model achieved notable results, including a rewards accuracy of 0.7400 and a rewards margin of 0.8984, indicating its effectiveness in distinguishing between preferred and rejected responses. The training involved a learning rate of 5e-07, a total batch size of 128, and was conducted over 1 epoch using 8 GPUs.

Intended Use Cases

This model is particularly well-suited for applications where aligning with human feedback and generating high-quality, preference-aligned text is crucial. Its DPO fine-tuning makes it a strong candidate for tasks requiring nuanced response generation and improved conversational quality.