amd/PARD-Qwen2.5-0.5B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:May 17, 2025License:mitArchitecture:Transformer Open Weights Warm

The amd/PARD-Qwen2.5-0.5B is a 0.5 billion parameter Qwen2.5-based parallel draft model developed by AMD. It is designed for accelerating Large Language Model (LLM) inference through a low-cost adaptation method, offering significant speedups compared to traditional autoregressive generation. This model is optimized for high-performance speculative decoding, enabling faster token generation across various target LLMs.

Loading preview...

PARD: Accelerating LLM Inference

The amd/PARD-Qwen2.5-0.5B is a 0.5 billion parameter Qwen2.5-based model developed by AMD, specifically designed as a parallel draft model for accelerating Large Language Model (LLM) inference. PARD (Parallel Draft Model Adaptation) is a high-performance speculative decoding method that offers substantial speedups with minimal adaptation overhead.

Key Capabilities & Advantages

  • Low-Cost Training: PARD efficiently adapts autoregressive (AR) draft models into parallel draft models, achieving an average inference speedup of 1.78× over pure AR draft models. It incorporates a conditional drop-token strategy to improve training efficiency by up to 3×.
  • Generalizability: Unlike target-dependent approaches, a single PARD draft model can accelerate an entire family of target models due to its target-independent design. This significantly reduces deployment complexity and adaptation costs.
  • High Performance: When integrated into optimized inference frameworks, PARD delivers impressive speedups. For instance, it achieves up to a 4.08× speedup with Transformers+ and up to 3.06× speedup in vLLM, outperforming other speculative decoding methods by 1.51× in vLLM.

Good for

  • Accelerating LLM Inference: Ideal for developers looking to significantly speed up the token generation process of various LLMs.
  • Reducing Deployment Complexity: Suitable for scenarios where a single draft model needs to accelerate multiple target models without extensive retraining.
  • Cost-Effective Adaptation: Beneficial for projects requiring high inference performance with minimal training and adaptation overhead.