PARD-2: Target-Aligned Parallel Draft Model

AMD's PARD-2 is an advanced speculative decoding method designed to significantly accelerate large language model (LLM) inference. This 1 billion parameter draft model, specifically amd/PARD2-Llama-3.1-8B, is optimized for dual-mode speculative decoding, focusing on maximizing consecutive token acceptance rather than just next-token prediction accuracy.

Key Innovations

Target-Aligned Optimization: Reformulates the draft-model objective to prioritize acceptance-length optimization, directly aligning with the speculative decoding process.
Confidence-Adaptive Token (CAT) Optimization: Adaptively reweights tokens based on their contribution to the verification process, enhancing alignment between draft generation and target-model acceptance.
Dual-Mode Speculative Decoding: Supports both target-independent and target-dependent modes, offering deployment flexibility and strong alignment capabilities.

Performance

PARD-2 achieves state-of-the-art performance, delivering up to 6.94x lossless acceleration across various models and tasks. On LLaMA3.1-8B, PARD-2 surpasses EAGLE-3 by 1.9x and the original PARD by 1.3x, establishing a new benchmark for speculative decoding throughput and latency trade-offs.

Overview

PARD-2: Target-Aligned Parallel Draft Model

Key Innovations

Performance

Full Model Card (README)