Overview
YeungNLP/firefly-qwen1.5-en-7b-dpo-v0.1 is a 7.7 billion parameter model built upon the Qwen1.5-7B architecture. It was developed by YeungNLP, initially fine-tuned with English instruction data, and subsequently enhanced using Direct Preference Optimization (DPO). The training was efficiently conducted on a single V100 GPU utilizing QLoRA.
Key Capabilities & Performance
This model functions as a helpful and harmless AI assistant, primarily optimized for English language interactions. It has demonstrated competitive performance on the Open LLM Leaderboard, surpassing official Qwen1.5-7B-Chat, Gemma-7B-it, and Zephyr-7B-Beta models in overall average scores. For instance, it achieves an average score of 62.36, with specific scores like 61.21 on MMLU and 54.13 on GSM8K. While primarily English-trained, it can also engage in Chinese conversations due to its Qwen1.5 base, though Chinese performance has not been formally evaluated.
Training Details
The model underwent both Supervised Fine-Tuning (SFT) and DPO stages, each performed on a single V100 GPU with QLoRA. Key hyperparameters for SFT included 1 epoch, a learning rate of 2e-4, and a max sequence length of 2048. DPO also used 1 epoch, a learning rate of 2e-4, and a max sequence length of 1600, with a max prompt length of 500. Training metrics for DPO, including rewards, accuracies, and loss, are provided, indicating effective preference alignment.