amd/PARD2-Qwen3-8B

TEXT GENERATIONConcurrency Cost:1Model Size:0.8BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jun 9, 2026License:mitArchitecture:Transformer Open Weights Cold

amd/PARD2-Qwen3-8B is a Qwen3-8B based parallel draft model developed by AMD for speculative decoding. It is optimized for maximizing consecutive token acceptance rather than just next-token prediction accuracy, utilizing Target-Aligned Optimization and Confidence-Adaptive Token (CAT) optimization. This model supports dual-mode speculative decoding, offering deployment flexibility and stronger alignment, and achieves up to 6.94x lossless acceleration in inference throughput.

Loading preview...

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

AMD's PARD-2 is an advanced speculative decoding method that significantly enhances the efficiency of large language models. Building upon the original PARD, PARD-2 introduces a novel Target-Aligned Parallel Draft Model designed for dual-mode speculative decoding. Unlike traditional draft models that focus solely on token-level prediction accuracy, PARD-2 aligns its training objective with the inference-time goal of maximizing consecutive token acceptance, leading to substantial performance gains.

Key Capabilities

  • Target-Aligned Optimization: Reformulates the draft-model objective from next-token prediction to acceptance-length optimization, better matching the speculative decoding process.
  • Confidence-Adaptive Token (CAT) Optimization: Adaptively reweights tokens based on their contribution to the verification process, improving alignment between draft generation and target-model acceptance.
  • Dual-Mode Speculative Decoding: A single PARD-2 draft model supports both target-independent and target-dependent modes, combining deployment flexibility with enhanced alignment.
  • State-of-the-Art Performance: Achieves up to 6.94x lossless acceleration across diverse models and tasks. For instance, on LLaMA3.1-8B, PARD-2 surpasses EAGLE-3 by 1.9x and PARD by 1.3x.

Good for

  • Accelerating inference throughput for large language models.
  • Optimizing speculative decoding performance with improved token acceptance rates.
  • Developers seeking flexible and highly efficient draft models for LLM deployment.