ldsjmdy/Tulu3-Block-FT
ldsjmdy/Tulu3-Block-FT is an 8 billion parameter language model based on the Tulu3 architecture, fine-tuned with a novel Block-attention mechanism. This model is specifically designed to significantly reduce inference latency and computational cost in Retrieval-Augmented Generation (RAG) scenarios by processing retrieved documents in discrete blocks. It achieves comparable performance to full-attention models across RAG, ICL, and general benchmarks, while drastically cutting time to first token (TTFT) and FLOPs, making it highly efficient for long-context applications.
Loading preview...
Tulu3-Block-FT: Efficient Prefilling with Block-Attention
ldsjmdy/Tulu3-Block-FT is an 8 billion parameter language model that implements a novel Block-attention mechanism, as detailed in the paper "Block-Attention for Efficient Prefilling". This model is specifically engineered to address the high inference latency and cost associated with Retrieval-Augmented Generation (RAG) scenarios, particularly with long contexts up to 32,768 tokens.
Key Capabilities & Differentiators
- Block-attention Mechanism: Unlike traditional methods that encode entire contexts auto-regressively, Block-attention divides retrieved documents into discrete blocks. Each block independently calculates key-value (KV) states, allowing for reuse of KV states for previously seen passages.
- Significant Efficiency Gains: This approach drastically reduces the time to first token (TTFT) and floating-point operations (FLOPs) during inference. For a 32K input sequence, it achieves a TTFT of just 45 ms, representing a 98.7% reduction in TTFT and a 99.8% reduction in FLOPs compared to full-attention models.
- Performance Parity: Despite its efficiency, Tulu3-Block-FT maintains performance comparable to full-attention models across 11 diverse benchmarks, including RAG, In-Context Learning (ICL), and general domains. It can seamlessly switch between block and full attention modes without performance degradation.
- RAG Optimization: The model is fine-tuned to adapt to the Block-attention mechanism, making it highly effective for RAG tasks by enabling efficient processing of multiple retrieved documents.
Should you use this model?
- For RAG applications requiring low latency: If your use case involves Retrieval-Augmented Generation with long contexts and demands minimal time to first token, Tulu3-Block-FT offers substantial efficiency improvements.
- For scenarios with repetitive context: The ability to reuse KV states for repeated blocks makes it ideal for applications where parts of the context might be consistent across queries.
- For Game AI: The paper highlights significant potential benefits for Game AI scenarios, suggesting its utility in interactive, real-time applications.