The tanliboy/lambda-qwen2.5-14b-dpo-test is a 14.8 billion parameter language model fine-tuned from Qwen/Qwen2.5-14B-Instruct. This model utilizes a large 131,072 token context length and has been optimized using Direct Preference Optimization (DPO) on the HuggingFaceH4/ultrafeedback_binarized dataset. It is designed for tasks requiring nuanced understanding and generation based on human preferences, demonstrating improved reward metrics over its base model.
No reviews yet. Be the first to review!