STAIR-Qwen2-7B-DPO-3 is a 7.6 billion parameter language model developed by thu-ml, fine-tuned from Qwen/Qwen2-7B-Instruct. It utilizes a self-improvement framework called STAIR, undergoing three iterations of DPO training using self-generated prompt-pair data. This model is designed to produce responses with explicit reasoning steps, making it suitable for tasks requiring transparent, step-by-step problem-solving and safety-conscious final answers.
Loading preview...
Model Overview
STAIR-Qwen2-7B-DPO-3 is a 7.6 billion parameter language model developed by thu-ml, built upon the Qwen/Qwen2-7B-Instruct architecture. This model has undergone a unique self-improvement process using the STAIR framework (STAIR paper), specifically through three iterations of Direct Preference Optimization (DPO) training. The training data for DPO was generated by the model itself, using prompts from various sources.
Key Capabilities
- Reasoning Step Generation: The model is designed to output responses that include explicit reasoning steps before providing a final answer. This structure enhances transparency and interpretability.
- Safety-Conscious Responses: By separating reasoning from the final output, the model can be evaluated for correctness and safety based on the final answer, which is extracted by splitting special tokens.
- Self-Improvement: Leverages the STAIR framework for iterative refinement, indicating a focus on enhancing performance through automated feedback loops.
Usage Considerations
This model is particularly well-suited for applications where not just the answer, but also the process of arriving at the answer, is important. Its structured output format, with <|Reasoning_step|> and <|Output|> tokens, facilitates parsing and evaluation of complex responses. Developers should note the specific token structure for extracting final answers, as demonstrated in the provided example.