reasonrag/Qwen2.5-7B-Instruct-ReasonRAG

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:May 19, 2025License:otherArchitecture:Transformer Cold

reasonrag/Qwen2.5-7B-Instruct-ReasonRAG is a 7.6 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-7B-Instruct. This model is specifically optimized for reasoning tasks, demonstrating improved performance on its evaluation set with a rewards/accuracies score of 0.6204. It is suitable for applications requiring enhanced logical coherence and problem-solving capabilities.

Loading preview...

Overview

reasonrag/Qwen2.5-7B-Instruct-ReasonRAG is an instruction-tuned language model based on the Qwen2.5-7B-Instruct architecture. It has been fine-tuned using the dpo_mcts_rag_v8 dataset, focusing on improving its reasoning and response generation capabilities through Direct Preference Optimization (DPO).

Key Capabilities

  • Enhanced Reasoning: Achieves a rewards/accuracies score of 0.6204 on its evaluation set, indicating improved performance in generating preferred and accurate responses.
  • Instruction Following: Benefits from its base Qwen2.5-7B-Instruct model's strong instruction-following abilities, further refined by DPO.
  • Optimized for Preference: The DPO training process aims to align the model's outputs more closely with human preferences, leading to higher-quality and more relevant responses.

Training Details

The model was trained with a learning rate of 1e-06, a batch size of 1 (accumulated to 12), and a cosine learning rate scheduler with a 0.2 warmup ratio over 1 epoch. The training resulted in a final loss of 0.8564 and a rewards/chosen score of 1.0146, alongside a rewards/rejected score of -0.5767, demonstrating its ability to differentiate between preferred and rejected responses.

When to Use This Model

This model is particularly well-suited for use cases where response quality, logical coherence, and alignment with human preferences are critical. It can be applied in scenarios requiring robust instruction following and improved reasoning over its base model.