Ejafa/qwen2-0.5b-instruct-simpo-lr-5e-07-gamma-1.5
Ejafa/qwen2-0.5b-instruct-simpo-lr-5e-07-gamma-1.5 is a 0.5 billion parameter instruction-tuned Qwen2-0.5B-Instruct model, fine-tuned by Ejafa Bassam and Yaroslav Ponomarenko as part of the Reinforcement Learning - 24 project at Peking University. This model focuses on the SIMPO (Simple Preference Optimization) method and was trained on the princeton-nlp/llama3-ultrafeedback dataset. It is designed for tasks requiring preference optimization, demonstrating specific reward metrics on its evaluation set.
Loading preview...
Model Overview
This model, Ejafa/qwen2-0.5b-instruct-simpo-lr-5e-07-gamma-1.5, is a fine-tuned version of the Qwen2-0.5B-Instruct architecture. Developed by Ejafa Bassam and Yaroslav Ponomarenko at Peking University, it was trained as part of the Reinforcement Learning - 24 project with a specific focus on the SIMPO (Simple Preference Optimization) method.
Key Characteristics
- Base Model: Qwen/Qwen2-0.5B-Instruct.
- Training Dataset: Fine-tuned on the
princeton-nlp/llama3-ultrafeedbackdataset. - Optimization Method: Utilizes the SIMPO approach for preference alignment.
- Training Hyperparameters: Employed a learning rate of 5e-07, a total training batch size of 128, and a cosine learning rate scheduler over 1 epoch.
Evaluation Performance
During evaluation, the model achieved a loss of 1.6594. Key reward metrics include:
- Rewards/accuracies: 0.5282
- Rewards/margins: 0.1325
- Rewards/chosen: -3.3473
- Rewards/rejected: -3.4798
Intended Uses
This model is suitable for research and applications involving preference-based learning and instruction following, particularly where a compact model size (0.5B parameters) is beneficial. Its training on a feedback dataset suggests potential for tasks requiring alignment with human preferences.