PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

AMD's PARD-2 is an advanced speculative decoding method that significantly enhances the efficiency of large language models. Building upon the original PARD, PARD-2 introduces a novel Target-Aligned Parallel Draft Model designed for dual-mode speculative decoding. Unlike traditional draft models that focus solely on token-level prediction accuracy, PARD-2 aligns its training objective with the inference-time goal of maximizing consecutive token acceptance, leading to substantial performance gains.

Key Capabilities

Target-Aligned Optimization: Reformulates the draft-model objective from next-token prediction to acceptance-length optimization, better matching the speculative decoding process.
Confidence-Adaptive Token (CAT) Optimization: Adaptively reweights tokens based on their contribution to the verification process, improving alignment between draft generation and target-model acceptance.
Dual-Mode Speculative Decoding: A single PARD-2 draft model supports both target-independent and target-dependent modes, combining deployment flexibility with enhanced alignment.
State-of-the-Art Performance: Achieves up to 6.94x lossless acceleration across diverse models and tasks. For instance, on LLaMA3.1-8B, PARD-2 surpasses EAGLE-3 by 1.9x and PARD by 1.3x.

Good for

Accelerating inference throughput for large language models.
Optimizing speculative decoding performance with improved token acceptance rates.
Developers seeking flexible and highly efficient draft models for LLM deployment.

Overview

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

Key Capabilities

Good for

Full Model Card (README)