Name: amd/PARD-DeepSeek-R1-Distill-Qwen-1.5B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: amd

PARD-DeepSeek-R1-Distill-Qwen-1.5B: Accelerating LLM Inference

The amd/PARD-DeepSeek-R1-Distill-Qwen-1.5B is a 1.5 billion parameter model developed by AMD, serving as a parallel draft model within the PARD (PARallel Draft Model Adaptation) framework. PARD is a high-performance speculative decoding method engineered to significantly accelerate LLM inference.

Key Capabilities & Differentiators

Low-Cost Adaptation: PARD efficiently converts autoregressive (AR) draft models into parallel draft models with minimal overhead, achieving an average inference speedup of 1.78x over pure AR draft models. It also incorporates a conditional drop-token strategy for up to 3x improved training efficiency.
Generalizability: Unlike target-dependent methods such as Medusa and EAGLE, PARD's design allows a single draft model to accelerate an entire family of target models. This reduces deployment complexity and adaptation costs by eliminating the need for retraining or tuning for each new target.
High Performance: When integrated into optimized inference frameworks, PARD delivers substantial speedups. For instance, it achieves up to a 4.08x speedup with Transformers+, reaching 311.5 tokens per second with LLaMA3.1 8B. In vLLM, it provides up to a 3.06x speedup, outperforming other speculative decoding methods by 1.51x.

Use Cases

This model is ideal for developers and researchers focused on:

Accelerating LLM Inference: Significantly reducing the latency of large language models.
Cost-Effective Deployment: Utilizing a single draft model across various target LLMs without extensive re-adaptation.
Research in Speculative Decoding: Exploring advanced methods for improving LLM generation speed and efficiency.