jackf857/llama-3-8b-base-cpo-ultrafeedback-4xH200-batch-128-rerun
The jackf857/llama-3-8b-base-cpo-ultrafeedback-4xH200-batch-128-rerun is an 8 billion parameter Llama 3 base model, fine-tuned using the CPO (Constitutional Preference Optimization) method on the HuggingFaceH4/ultrafeedback_binarized dataset. This model is a re-run of a previous training, building upon W-61/llama-3-8b-base-sft-ultrachat-8xh200. It is optimized for preference alignment, demonstrating improved reward metrics on the evaluation set, making it suitable for tasks requiring nuanced response generation and alignment with human preferences.
Loading preview...
Model Overview
This model, jackf857/llama-3-8b-base-cpo-ultrafeedback-4xH200-batch-128-rerun, is an 8 billion parameter language model based on the Llama 3 architecture. It has been fine-tuned using the Constitutional Preference Optimization (CPO) method, specifically on the HuggingFaceH4/ultrafeedback_binarized dataset. This training approach aims to align the model's outputs more closely with human preferences.
Key Characteristics
- Base Model: Fine-tuned from
W-61/llama-3-8b-base-sft-ultrachat-8xh200. - Training Method: Utilizes CPO for preference alignment.
- Evaluation Metrics: Achieved a rewards accuracy of 0.5160 and a rewards margin of -0.0586 on its evaluation set, indicating its performance in distinguishing between preferred and rejected responses.
- Context Length: Supports an 8192 token context window.
Intended Use Cases
This model is particularly well-suited for applications where generating responses that are aligned with human preferences is crucial. Its CPO fine-tuning makes it a strong candidate for:
- Dialogue systems: Generating more helpful and harmless conversational turns.
- Content generation: Producing text that adheres to specific quality or style guidelines based on preference data.
- Assistant models: Enhancing the quality and relevance of AI assistant responses.