Name: ldsjmdy/Tulu3-RAG API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ldsjmdy

Tulu3-RAG: Efficient RAG with Block-Attention

ldsjmdy/Tulu3-RAG is an 8 billion parameter model designed to optimize Retrieval-Augmented Generation (RAG) workflows. It introduces Block-attention, an innovative attention mechanism that addresses the high inference latency and cost typically associated with RAG.

Key Capabilities & Features

Block-attention Mechanism: Divides retrieved documents into discrete blocks, calculating Key-Value (KV) states independently for each block (except the final one). This allows for the reuse of KV states for passages seen before, significantly reducing computation.
Reduced Latency and Cost: Achieves a remarkable 98.7% reduction in Time To First Token (TTFT) and a 99.8% reduction in FLOPs compared to full-attention models. It can output the first token for a 32K input sequence in just 45 ms.
Performance Parity: After block fine-tuning, the model maintains performance comparable to full-attention models across 11 diverse benchmarks, including RAG, In-Context Learning (ICL), and general domains.
Flexible Attention Modes: Can seamlessly switch between Block-attention and full-attention modes without performance degradation.
High Context Length: Supports a context length of 32768 tokens.

Use Cases & Strengths

Efficient RAG: Ideal for applications requiring fast and cost-effective retrieval-augmented generation, especially with long contexts.
Game AI: The developers highlight substantial potential benefits for researchers in the gaming field due to its efficiency in handling contextual information.
Instruction Following: The base Tulu3-SFT model, from which Tulu3-RAG is derived, is a strong instruction-tuned model.

Performance Highlights

On RAG benchmarks, Tulu3-RAG (and its block-fine-tuned variant) demonstrates strong accuracy:

2wiki: 73.2% (Tulu3-RAG), 73.6% (Tulu3-block-ft-full)
HQA: 74.8% (Tulu3-RAG), 75.2% (Tulu3-block-ft-full)
NQ: 61.5% (Tulu3-RAG), 62.2% (Tulu3-block-ft-full)
TQA: 75.8% (Tulu3-RAG), 76.2% (Tulu3-block-ft-full)

For general benchmarks, Tulu3-block-ft shows competitive results, particularly in IFEval (70.0%) and DROP (14.4%).

Overview

Tulu3-RAG: Efficient RAG with Block-Attention

Key Capabilities & Features

Use Cases & Strengths

Performance Highlights

Full Model Card (README)