Ejafa/qwen2-0.5b-instruct-simpo-lr-5e-07-gamma-1.5

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Jun 21, 2024License:apache-2.0Architecture:Transformer Open Weights Warm

Ejafa/qwen2-0.5b-instruct-simpo-lr-5e-07-gamma-1.5 is a 0.5 billion parameter instruction-tuned Qwen2-0.5B-Instruct model, fine-tuned by Ejafa Bassam and Yaroslav Ponomarenko as part of the Reinforcement Learning - 24 project at Peking University. This model focuses on the SIMPO (Simple Preference Optimization) method and was trained on the princeton-nlp/llama3-ultrafeedback dataset. It is designed for tasks requiring preference optimization, demonstrating specific reward metrics on its evaluation set.

Loading preview...

Model Overview

This model, Ejafa/qwen2-0.5b-instruct-simpo-lr-5e-07-gamma-1.5, is a fine-tuned version of the Qwen2-0.5B-Instruct architecture. Developed by Ejafa Bassam and Yaroslav Ponomarenko at Peking University, it was trained as part of the Reinforcement Learning - 24 project with a specific focus on the SIMPO (Simple Preference Optimization) method.

Key Characteristics

  • Base Model: Qwen/Qwen2-0.5B-Instruct.
  • Training Dataset: Fine-tuned on the princeton-nlp/llama3-ultrafeedback dataset.
  • Optimization Method: Utilizes the SIMPO approach for preference alignment.
  • Training Hyperparameters: Employed a learning rate of 5e-07, a total training batch size of 128, and a cosine learning rate scheduler over 1 epoch.

Evaluation Performance

During evaluation, the model achieved a loss of 1.6594. Key reward metrics include:

  • Rewards/accuracies: 0.5282
  • Rewards/margins: 0.1325
  • Rewards/chosen: -3.3473
  • Rewards/rejected: -3.4798

Intended Uses

This model is suitable for research and applications involving preference-based learning and instruction following, particularly where a compact model size (0.5B parameters) is beneficial. Its training on a feedback dataset suggests potential for tasks requiring alignment with human preferences.