CalamitousFelicitousness/Qwen2.5-32B-Instruct-fp8-dynamic

TEXT GENERATIONConcurrency Cost:2Model Size:32.8BQuant:FP8Ctx Length:32kPublished:Sep 18, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

Qwen2.5-32B-Instruct is a 32.5 billion parameter instruction-tuned causal language model developed by Qwen, based on the Qwen2 architecture. It features significant improvements in coding, mathematics, instruction following, and long text generation up to 8K tokens, with a full context length of 131,072 tokens. This model is designed for robust performance across diverse tasks, including structured data understanding and multilingual support for over 29 languages.

Loading preview...

Qwen2.5-32B-Instruct Overview

Qwen2.5-32B-Instruct is an instruction-tuned model from the latest Qwen2.5 series, developed by Qwen. This 32.5 billion parameter causal language model builds upon the Qwen2 architecture, incorporating key enhancements for improved performance across various domains. It supports an impressive context length of 131,072 tokens and can generate outputs up to 8,192 tokens.

Key Capabilities & Improvements

  • Enhanced Knowledge & Reasoning: Significantly improved capabilities in coding and mathematics due to specialized expert models.
  • Instruction Following: Demonstrates substantial improvements in adhering to instructions and generating structured outputs, including JSON.
  • Long Text Handling: Excels at generating long texts (over 8K tokens) and understanding structured data like tables.
  • System Prompt Resilience: More robust to diverse system prompts, enhancing role-play and chatbot condition-setting.
  • Multilingual Support: Provides comprehensive support for over 29 languages, including major global languages.
  • Architecture: Utilizes transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias.

Long Context Processing

The model's config.json is set for 32,768 tokens, but it can handle up to 131,072 tokens using YaRN for length extrapolation. For optimal deployment with long contexts, vLLM is recommended, though users should note that static YaRN in vLLM might impact performance on shorter texts if rope_scaling is always enabled.