amd/PARD2-Llama-3.1-8B
amd/PARD2-Llama-3.1-8B is a 1 billion parameter draft model developed by AMD for dual-mode speculative decoding, designed to accelerate large language model inference. It introduces Target-Aligned Parallel Draft Model optimization and Confidence-Adaptive Token (CAT) optimization to maximize consecutive token acceptance during speculative decoding. This model achieves up to 6.94x lossless acceleration, significantly outperforming previous speculative decoding methods like EAGLE-3 and PARD on LLaMA3.1-8B.
Loading preview...
PARD-2: Target-Aligned Parallel Draft Model
AMD's PARD-2 is an advanced speculative decoding method designed to significantly accelerate large language model (LLM) inference. This 1 billion parameter draft model, specifically amd/PARD2-Llama-3.1-8B, is optimized for dual-mode speculative decoding, focusing on maximizing consecutive token acceptance rather than just next-token prediction accuracy.
Key Innovations
- Target-Aligned Optimization: Reformulates the draft-model objective to prioritize acceptance-length optimization, directly aligning with the speculative decoding process.
- Confidence-Adaptive Token (CAT) Optimization: Adaptively reweights tokens based on their contribution to the verification process, enhancing alignment between draft generation and target-model acceptance.
- Dual-Mode Speculative Decoding: Supports both target-independent and target-dependent modes, offering deployment flexibility and strong alignment capabilities.
Performance
PARD-2 achieves state-of-the-art performance, delivering up to 6.94x lossless acceleration across various models and tasks. On LLaMA3.1-8B, PARD-2 surpasses EAGLE-3 by 1.9x and the original PARD by 1.3x, establishing a new benchmark for speculative decoding throughput and latency trade-offs.