zai-org/apar-13b

TEXT GENERATIONConcurrency Cost:1Model Size:13BQuant:FP8Ctx Length:4kPublished:Jul 22, 2024Architecture:Transformer0.0K Cold

APAR-13B is a 13 billion parameter language model developed by zai-org, designed for efficient text generation. It introduces a parallel auto-regressive decoding method, enabling LLMs to plan generation independently and perform auto-parallel auto-regressive (APAR) generation. This approach significantly reduces generation steps, leading to speed-ups and improved throughput in high-throughput serving scenarios.

Loading preview...

APAR-13B: Efficient Auto-Parallel Auto-Regressive Decoding

APAR-13B is a 13 billion parameter language model developed by zai-org, focusing on highly efficient text generation. Its core innovation is the Auto-Parallel Auto-Regressive (APAR) decoding method, which allows the model to independently plan its generation process. This is achieved through instruct-tuning on general domain data containing hierarchical structures.

Key Capabilities & Performance:

  • Parallel Generation: Enables LLMs to generate text in parallel, significantly reducing the number of sequential decoding steps.
  • Speed-up: APAR alone can achieve up to a 2x speed-up in generation. When combined with speculative decoding, this can reach up to 4x.
  • Resource Optimization: Reduces key-value cache consumption and attention computation during generation.
  • Improved Throughput: Leads to a 20-70% increase in throughput in high-throughput scenarios compared to state-of-the-art serving frameworks.
  • Reduced Latency: Decreases latency by 20-35% in high-throughput environments.

Use Cases:

  • Efficient LLM Serving: Ideal for applications requiring high-throughput and low-latency text generation.
  • Cost-Effective Deployment: Benefits scenarios where computational resources and inference speed are critical.

For more technical details, refer to the APAR paper and the GitHub repository.