PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

PARD-2 is an advanced speculative decoding method developed by AMD, building upon the original PARD framework. This model, including the amd/PARD2-Qwen3-14B variant, focuses on optimizing draft models for the inference-time objective of maximizing consecutive token acceptance during speculative decoding, rather than just token-level prediction accuracy. This approach significantly enhances the efficiency of large language model inference.

Key Capabilities

Target-Aligned Optimization: Reformulates the draft-model objective from next-token prediction to acceptance-length optimization, better matching the draft-then-verify process of speculative decoding.
Confidence-Adaptive Token (CAT) Optimization: Introduces adaptive reweighting of tokens based on their contribution to the verification process, improving alignment between draft generation and target-model acceptance.
Dual-Mode Speculative Decoding: A single PARD-2 draft model supports both target-independent and target-dependent modes, offering deployment flexibility with strong alignment capabilities.
High Performance: Achieves up to 6.94x lossless acceleration across diverse models and tasks. For instance, on LLaMA3.1-8B, PARD-2 surpasses EAGLE-3 by 1.9x and PARD by 1.3x.

Good for

Accelerating the inference speed and improving throughput of large language models.
Reducing latency in LLM applications, especially across various batch sizes (from 1 to 64).
Developers looking for an efficient speculative decoding solution that is optimized for acceptance length rather than just token prediction accuracy.
Research and development in high-performance LLM serving and deployment.

Overview

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

Key Capabilities

Good for

Full Model Card (README)