tsavage68/chat_1000STEPS_1e6_05beta_DPO
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Feb 16, 2024Architecture:Transformer Cold
The tsavage68/chat_1000STEPS_1e6_05beta_DPO is a 7 billion parameter language model, fine-tuned from Meta's Llama-2-7b-chat-hf base model using a Direct Preference Optimization (DPO) training approach. This model demonstrates a reward accuracy of 53.19% on its evaluation set, indicating its ability to differentiate between preferred and rejected responses. It is suitable for chat-based applications where preference alignment is crucial, building upon the robust Llama 2 architecture.
Loading preview...