flashresearch/FlashResearch-4B-Thinking

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Oct 1, 2025License:mitArchitecture:Transformer0.0K Open Weights Warm

FlashResearch-4B-Thinking is a 4-billion parameter Qwen model developed by flashresearch, distilled from the Tongyi DeepResearch-30B A3B MoE model. It is specifically optimized for web-scale deep research tasks, including browsing, multi-step reasoning, and source-grounded answers. This model is designed for efficient inference, particularly when integrated with the Alibaba-NLP/DeepResearch framework, making it suitable for fast, low-cost agent runs.

Loading preview...

Overview

FlashResearch-4B-Thinking is a 4-billion parameter Qwen model, a dense architecture, that has been distilled from the larger Tongyi DeepResearch-30B A3B MoE model. This distillation process utilized 33,000 curated deep-research examples from the flashresearch/FlashResearch-DS-33k dataset. The model is primarily intended for integration with the Alibaba-NLP/DeepResearch framework.

Key Capabilities

  • Web-scale Deep Research: Optimized for tasks requiring extensive information retrieval and synthesis.
  • Multi-step Reasoning: Designed to handle complex queries involving multiple logical steps.
  • Source-Grounded Answers: Focuses on providing responses backed by identified sources.
  • Efficient Inference: Engineered for fast and low-cost operation, suitable for agent-based applications.

Recommended Use

This model is specifically designed to be used directly with the Alibaba-NLP/DeepResearch repository for agent runs. It offers a cost-effective solution for deploying deep research agents, with hardware requirements as low as a single 12-16GB GPU for FP16 inference, or even less with quantization.