ldsjmdy/Tulu3-RAG

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Feb 28, 2025Architecture:Transformer0.0K Cold

Tulu3-RAG is an 8 billion parameter language model developed by ldsjmdy, specifically fine-tuned for Retrieval-Augmented Generation (RAG) scenarios. It implements a novel Block-attention mechanism to significantly reduce inference latency and computational cost in RAG by reusing Key-Value states of previously seen document blocks. This model achieves comparable accuracy to full-attention models while drastically improving Time To First Token (TTFT) and FLOPs, making it highly efficient for RAG applications and potentially Game AI.

Loading preview...

Tulu3-RAG: Efficient RAG with Block-Attention

ldsjmdy/Tulu3-RAG is an 8 billion parameter model designed to optimize Retrieval-Augmented Generation (RAG) workflows. It introduces Block-attention, an innovative attention mechanism that addresses the high inference latency and cost typically associated with RAG.

Key Capabilities & Features

  • Block-attention Mechanism: Divides retrieved documents into discrete blocks, calculating Key-Value (KV) states independently for each block (except the final one). This allows for the reuse of KV states for passages seen before, significantly reducing computation.
  • Reduced Latency and Cost: Achieves a remarkable 98.7% reduction in Time To First Token (TTFT) and a 99.8% reduction in FLOPs compared to full-attention models. It can output the first token for a 32K input sequence in just 45 ms.
  • Performance Parity: After block fine-tuning, the model maintains performance comparable to full-attention models across 11 diverse benchmarks, including RAG, In-Context Learning (ICL), and general domains.
  • Flexible Attention Modes: Can seamlessly switch between Block-attention and full-attention modes without performance degradation.
  • High Context Length: Supports a context length of 32768 tokens.

Use Cases & Strengths

  • Efficient RAG: Ideal for applications requiring fast and cost-effective retrieval-augmented generation, especially with long contexts.
  • Game AI: The developers highlight substantial potential benefits for researchers in the gaming field due to its efficiency in handling contextual information.
  • Instruction Following: The base Tulu3-SFT model, from which Tulu3-RAG is derived, is a strong instruction-tuned model.

Performance Highlights

On RAG benchmarks, Tulu3-RAG (and its block-fine-tuned variant) demonstrates strong accuracy:

  • 2wiki: 73.2% (Tulu3-RAG), 73.6% (Tulu3-block-ft-full)
  • HQA: 74.8% (Tulu3-RAG), 75.2% (Tulu3-block-ft-full)
  • NQ: 61.5% (Tulu3-RAG), 62.2% (Tulu3-block-ft-full)
  • TQA: 75.8% (Tulu3-RAG), 76.2% (Tulu3-block-ft-full)

For general benchmarks, Tulu3-block-ft shows competitive results, particularly in IFEval (70.0%) and DROP (14.4%).