MoxoffSrL/Moxoff-Phi3Mini-PPO

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:4kPublished:Jun 25, 2024License:mitArchitecture:Transformer Open Weights Cold

MoxoffSrL/Moxoff-Phi3Mini-PPO is a 4 billion parameter causal language model, developed by MoxoffSrL, based on the Phi-3-mini-128k-instruct architecture. This model has been specifically aligned using Proximal Policy Optimization (PPO) on the ultrafeedback-binarized-preferences-cleaned dataset. It is designed for general language tasks, demonstrating competitive performance on benchmarks like HellaSwag, ARC Challenge, and MMLU.

Loading preview...

Model Overview

Moxoff-Phi3Mini-PPO is a 4 billion parameter language model developed by MoxoffSrL. It is an enhanced version of the Microsoft Phi-3-mini-128k-instruct model, distinguished by its alignment through Proximal Policy Optimization (PPO).

Key Characteristics

  • PPO Alignment: The model has undergone PPO alignment using the ultrafeedback-binarized-preferences-cleaned dataset, which helps in refining its responses based on human preferences.
  • Performance: Evaluated against the Open LLM Leaderboard test sets, it achieves an average score of 0.5833, with specific scores of 0.7044 on HellaSwag acc_norm, 0.4701 on ARC Challenge acc_norm, and 0.5814 on MMLU 5-shot acc.

Limitations

  • Safety: This model has not been explicitly aligned for safety with human preferences through RLHF or in-the-loop filtering, meaning it may produce problematic outputs, especially when prompted to do so.
  • Base Model Training Data: The exact size and composition of the training corpus for the base Phi-3 model are not fully known, though it likely includes a mix of web data and technical sources.