Kyleyee/CPO_hh-seed4
Kyleyee/CPO_hh-seed4 is a 1.5 billion parameter language model fine-tuned by Kyleyee, based on Kyleyee/Qwen2.5-1.5B-sft-hh-3e. It was trained using Contrastive Preference Optimization (CPO) on a helpfulness preference dataset, making it specialized in generating helpful and aligned responses. This model is designed for tasks requiring nuanced preference-based text generation, leveraging its 32768 token context length.
Loading preview...
Model Overview
Kyleyee/CPO_hh-seed4 is a 1.5 billion parameter language model developed by Kyleyee, building upon the Kyleyee/Qwen2.5-1.5B-sft-hh-3e base model. It has been specifically fine-tuned using the Contrastive Preference Optimization (CPO) method, as detailed in the paper "Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation." The training utilized the Kyleyee/train_data_Helpful_drdpo_preference dataset and the TRL framework.
Key Capabilities
- Preference-aligned Generation: Optimized for generating responses that align with human preferences, particularly for helpfulness, due to its CPO training on a preference dataset.
- Foundation Model: Based on the Qwen2.5 architecture, providing a robust base for language understanding and generation tasks.
- Extended Context Window: Supports a context length of 32768 tokens, allowing for processing and generating longer, more coherent texts.
Training Details
The model's unique characteristic stems from its training with CPO, a method designed to enhance LLM performance by leveraging contrastive learning on preference data. This approach aims to improve the model's ability to differentiate between preferred and non-preferred outputs, leading to more refined and aligned generations. The training was conducted using TRL (Transformer Reinforcement Learning) framework, with specific versions of TRL, Transformers, Pytorch, Datasets, and Tokenizers.
Use Cases
This model is particularly well-suited for applications where generating helpful, preference-aligned, and contextually rich text is crucial. Its CPO training makes it a strong candidate for tasks requiring nuanced understanding of user preferences and generating responses that reflect those preferences effectively.