Jaew00Lee/Qwen3-4B-PRInTS
Jaew00Lee/Qwen3-4B-PRInTS is a 4 billion parameter Qwen3-based generative process reward model developed by Jaewoo Lee, Archiki Prasad, Justin Chih-Yao Chen, Zaid Khan, Elias Stengel-Eskin, and Mohit Bansal. It is specifically fine-tuned for long-horizon information-seeking tasks, excelling at evaluating agent trajectory steps and recursively summarizing context. The model's primary strength lies in providing fine-grained guidance for information-seeking agents by scoring candidate next steps and maintaining a compact information-seeking trajectory summary within its 40960 token context window.
Loading preview...
Overview of PRInTS Qwen3-4B
PRInTS (Process Reward via Information gain scoring and Trajectory Summarization) Qwen3-4B is a 4 billion parameter generative process reward model developed by Jaewoo Lee et al. It is fine-tuned from the Qwen3-4B base model with a substantial 40960 token context length, designed to address the challenges of context accumulation in long-horizon information-seeking tasks.
Key Capabilities
- Generative Process Reward Model (PRM): Jointly trained with two core abilities for fine-grained guidance.
- Scoring Mechanism: Evaluates multiple candidate next trajectory steps for an agent, providing dense scores based on reasoning across various step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness).
- Trajectory Summarization: Recursively updates a compact information-seeking trajectory summary. This feature helps keep input length bounded while preserving critical information for subsequent score evaluations.
Use Cases
- Agent Guidance: Provides fine-grained, step-level guidance for information-seeking agents at test time.
- Information-Seeking Tasks: Optimized for scenarios requiring long-horizon information retrieval and processing.
- Trajectory Evaluation: Estimates step-level information-gain scores across multiple agent rollouts, enhancing decision-making for complex tasks.
This model is licensed under MIT and its development is detailed in the paper PRInTS: Reward Modeling for Long-Horizon Information Seeking.