Tiiny/SmallThinker-3B-Preview

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Dec 12, 2024Architecture:Transformer0.4K Cold

SmallThinker-3B-Preview is a 3.1 billion parameter instruction-tuned causal language model developed by Tiiny, fine-tuned from Qwen2.5-3B-Instruct with a 32768 token context length. It demonstrates enhanced mathematical and reasoning capabilities, outperforming its base model and GPT-4o on several benchmarks like AIME24 and GAOKAO2024. This model is primarily optimized for efficient edge deployment on resource-constrained devices and can serve as a fast draft model for larger language models.

Loading preview...

SmallThinker-3B-Preview Overview

SmallThinker-3B-Preview is a 3.1 billion parameter language model, fine-tuned from the Qwen2.5-3B-Instruct architecture. It features a substantial 32768 token context length, making it suitable for processing longer inputs. The model was developed through a two-phase Supervised Fine-Tuning (SFT) process using 8 H100 GPUs, leveraging datasets like PowerInfer/QWQ-LONGCOT-500K and PowerInfer/LONGCOT-Refine.

Key Capabilities & Performance

SmallThinker-3B-Preview shows significant improvements in mathematical and reasoning tasks compared to its base model, Qwen2.5-3B-Instruct, and even surpasses GPT-4o on specific benchmarks. For instance, it achieves 16.667 on AIME24 (vs. 6.67 for Qwen2.5-3B-Instruct and 9.3 for GPT-4o) and 68.2 on MMLU_STEM (vs. 59.8 for Qwen2.5-3B-Instruct and 64.2 for GPT-4o). It also scores 70 on AMPS_Hard and 46.8 on math_comp, indicating strong performance in complex problem-solving.

Ideal Use Cases

  • Edge Deployment: Its compact size makes it highly efficient for deployment on devices with limited computational resources.
  • Draft Model: SmallThinker can function as a rapid and efficient draft model for larger language models, such as QwQ-32B-Preview, offering significant speedups (e.g., 70% faster in llama.cpp).

Limitations

Currently, SmallThinker-3B-Preview has limitations including English-only language support, constrained reasoning due to its size and SFT data, and potential for unpredictable or repetitive outputs, especially with high-difficulty questions. Users may need to adjust repetition_penalty to mitigate repetition issues.