Name: thu-ml/STAIR-Qwen2-7B-DPO-3 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: thu-ml

Model Overview

STAIR-Qwen2-7B-DPO-3 is a 7.6 billion parameter language model developed by thu-ml, built upon the Qwen/Qwen2-7B-Instruct architecture. This model has undergone a unique self-improvement process using the STAIR framework (STAIR paper), specifically through three iterations of Direct Preference Optimization (DPO) training. The training data for DPO was generated by the model itself, using prompts from various sources.

Key Capabilities

Reasoning Step Generation: The model is designed to output responses that include explicit reasoning steps before providing a final answer. This structure enhances transparency and interpretability.
Safety-Conscious Responses: By separating reasoning from the final output, the model can be evaluated for correctness and safety based on the final answer, which is extracted by splitting special tokens.
Self-Improvement: Leverages the STAIR framework for iterative refinement, indicating a focus on enhancing performance through automated feedback loops.

Usage Considerations

This model is particularly well-suited for applications where not just the answer, but also the process of arriving at the answer, is important. Its structured output format, with <|Reasoning_step|> and <|Output|> tokens, facilitates parsing and evaluation of complex responses. Developers should note the specific token structure for extracting final answers, as demonstrated in the provided example.

Overview

Model Overview

Key Capabilities

Usage Considerations

Full Model Card (README)