ldsjmdy/Tulu3-RAG
Tulu3-RAG is an 8 billion parameter language model developed by ldsjmdy, specifically fine-tuned for Retrieval-Augmented Generation (RAG) scenarios. It implements a novel Block-attention mechanism to significantly reduce inference latency and computational cost in RAG by reusing Key-Value states of previously seen document blocks. This model achieves comparable accuracy to full-attention models while drastically improving Time To First Token (TTFT) and FLOPs, making it highly efficient for RAG applications and potentially Game AI.
Loading preview...
Tulu3-RAG: Efficient RAG with Block-Attention
ldsjmdy/Tulu3-RAG is an 8 billion parameter model designed to optimize Retrieval-Augmented Generation (RAG) workflows. It introduces Block-attention, an innovative attention mechanism that addresses the high inference latency and cost typically associated with RAG.
Key Capabilities & Features
- Block-attention Mechanism: Divides retrieved documents into discrete blocks, calculating Key-Value (KV) states independently for each block (except the final one). This allows for the reuse of KV states for passages seen before, significantly reducing computation.
- Reduced Latency and Cost: Achieves a remarkable 98.7% reduction in Time To First Token (TTFT) and a 99.8% reduction in FLOPs compared to full-attention models. It can output the first token for a 32K input sequence in just 45 ms.
- Performance Parity: After block fine-tuning, the model maintains performance comparable to full-attention models across 11 diverse benchmarks, including RAG, In-Context Learning (ICL), and general domains.
- Flexible Attention Modes: Can seamlessly switch between Block-attention and full-attention modes without performance degradation.
- High Context Length: Supports a context length of 32768 tokens.
Use Cases & Strengths
- Efficient RAG: Ideal for applications requiring fast and cost-effective retrieval-augmented generation, especially with long contexts.
- Game AI: The developers highlight substantial potential benefits for researchers in the gaming field due to its efficiency in handling contextual information.
- Instruction Following: The base Tulu3-SFT model, from which Tulu3-RAG is derived, is a strong instruction-tuned model.
Performance Highlights
On RAG benchmarks, Tulu3-RAG (and its block-fine-tuned variant) demonstrates strong accuracy:
- 2wiki: 73.2% (Tulu3-RAG), 73.6% (Tulu3-block-ft-full)
- HQA: 74.8% (Tulu3-RAG), 75.2% (Tulu3-block-ft-full)
- NQ: 61.5% (Tulu3-RAG), 62.2% (Tulu3-block-ft-full)
- TQA: 75.8% (Tulu3-RAG), 76.2% (Tulu3-block-ft-full)
For general benchmarks, Tulu3-block-ft shows competitive results, particularly in IFEval (70.0%) and DROP (14.4%).