wxzhang/dpo-selective-buffer-spo-shift
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kArchitecture:Transformer Cold

The wxzhang/dpo-selective-buffer-spo-shift is a 7 billion parameter language model trained from scratch using a DPO (Direct Preference Optimization) method. It was developed by wxzhang and exhibits specific reward metrics for chosen and rejected responses, indicating its optimization for preference alignment. This model is characterized by its DPO training procedure, which aims to align its outputs with human preferences, making it suitable for tasks requiring nuanced response generation based on reward signals.

Loading preview...