unsloth/Qwen2.5-7B-Instruct-1M

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Jan 27, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

Qwen2.5-7B-Instruct-1M is a 7.61 billion parameter instruction-tuned causal language model developed by Qwen, featuring a transformer architecture. This model is specifically optimized for ultra-long context tasks, supporting an impressive context length of up to 1 million tokens while maintaining strong performance on shorter tasks. It is designed for efficient deployment with custom vLLM frameworks that utilize sparse attention and length extrapolation for enhanced accuracy and speed.

Loading preview...

Overview

Qwen2.5-7B-Instruct-1M is a 7.61 billion parameter instruction-tuned causal language model from the Qwen2.5 series, developed by Qwen. Its primary differentiator is its exceptional long-context capability, supporting up to 1 million tokens, significantly outperforming the 128K version in long-context scenarios while retaining strong performance on short tasks. The model utilizes a transformer architecture with RoPE, SwiGLU, and RMSNorm.

Key Capabilities

  • Ultra-Long Context Handling: Supports a full context length of 1,010,000 tokens for input and 8192 tokens for generation.
  • Optimized Deployment: Designed for deployment with a custom vLLM framework that incorporates sparse attention and length extrapolation, enabling 3-7x speedup for sequences up to 1M tokens and improved accuracy for sequences exceeding 256K tokens.
  • Instruction Following: Instruction-tuned for various tasks, as indicated by its "Instruct" designation.
  • Efficient Inference: The custom vLLM framework allows for efficient processing of long sequences, with recommendations for specific GPU architectures (Ampere or Hopper) and VRAM requirements (120GB for 7B model at 1M tokens).

Good for

  • Applications requiring extensive context: Ideal for tasks like summarizing very long documents, analyzing large codebases, or processing lengthy conversations.
  • Developers seeking optimized long-context inference: The model's integration with a specialized vLLM framework makes it suitable for those needing high performance and accuracy with ultra-long inputs.
  • Research and development in long-sequence understanding: Provides a robust base for exploring and building applications that push the boundaries of context window limitations.