THU-KEG/DeepDive-4B-SFT

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 12, 2026Architecture:Transformer Warm

THU-KEG/DeepDive-4B-SFT is a 4 billion parameter instruction-tuned model developed by THU-KEG, specifically fine-tuned for deep search agents. This model is designed to enhance robust reinforcement learning by incorporating citation-aware rubric rewards, as detailed in the associated research paper. It specializes in tasks requiring evidence chaining and advanced information retrieval, offering a 32768 token context length for complex queries.

Loading preview...

THU-KEG/DeepDive-4B-SFT Overview

THU-KEG/DeepDive-4B-SFT is a 4 billion parameter instruction-tuned model developed by THU-KEG, primarily designed to support advanced deep search agents. This model is a key component of the research presented in the paper "Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards." Its core innovation lies in its fine-tuning for robust reinforcement learning, specifically by leveraging citation-aware rubric rewards to improve agent performance and reliability.

Key Capabilities

  • Enhanced Deep Search: Optimized for tasks requiring agents to perform in-depth information retrieval and evidence chaining.
  • Citation-Aware Rewards: Integrates a novel reward mechanism that considers citation quality and relevance, leading to more robust learning.
  • Reinforcement Learning Integration: Designed to be a foundational component for developing sophisticated RL-based search agents.
  • Large Context Window: Features a 32768-token context length, enabling the processing of extensive search results and complex queries.

Good For

  • Researchers and developers working on advanced search agents and information retrieval systems.
  • Applications requiring robust evidence-based reasoning and citation analysis.
  • Experiments in reinforcement learning for complex, knowledge-intensive tasks.
  • Projects that benefit from a model specifically trained to understand and utilize contextual evidence from diverse sources.