amd/PARD2-Qwen3-14B

TEXT GENERATIONConcurrency Cost:1Model Size:0.8BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jun 9, 2026License:mitArchitecture:Transformer Open Weights Cold

amd/PARD2-Qwen3-14B is a 0.8 billion parameter parallel draft model developed by AMD, designed for dual-mode speculative decoding. It is based on the Qwen3 architecture and optimized for maximizing consecutive token acceptance rather than just next-token prediction accuracy. This model significantly accelerates large language model inference, achieving up to 6.94x lossless acceleration by aligning draft-model training with the inference-time objective of speculative decoding. Its primary use case is to enhance the throughput and reduce the latency of LLM inference across various batch sizes.

Loading preview...

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

PARD-2 is an advanced speculative decoding method developed by AMD, building upon the original PARD framework. This model, including the amd/PARD2-Qwen3-14B variant, focuses on optimizing draft models for the inference-time objective of maximizing consecutive token acceptance during speculative decoding, rather than just token-level prediction accuracy. This approach significantly enhances the efficiency of large language model inference.

Key Capabilities

  • Target-Aligned Optimization: Reformulates the draft-model objective from next-token prediction to acceptance-length optimization, better matching the draft-then-verify process of speculative decoding.
  • Confidence-Adaptive Token (CAT) Optimization: Introduces adaptive reweighting of tokens based on their contribution to the verification process, improving alignment between draft generation and target-model acceptance.
  • Dual-Mode Speculative Decoding: A single PARD-2 draft model supports both target-independent and target-dependent modes, offering deployment flexibility with strong alignment capabilities.
  • High Performance: Achieves up to 6.94x lossless acceleration across diverse models and tasks. For instance, on LLaMA3.1-8B, PARD-2 surpasses EAGLE-3 by 1.9x and PARD by 1.3x.

Good for

  • Accelerating the inference speed and improving throughput of large language models.
  • Reducing latency in LLM applications, especially across various batch sizes (from 1 to 64).
  • Developers looking for an efficient speculative decoding solution that is optimized for acceptance length rather than just token prediction accuracy.
  • Research and development in high-performance LLM serving and deployment.