Overview

Gradient's Llama-3-8B-Instruct-Gradient-1048k is an 8 billion parameter instruction-tuned model based on Meta's Llama 3 8B. Its primary differentiator is a dramatically extended context window, increased from the base model's 8K tokens to over 1 million tokens (1048K). This was achieved by Gradient through NTK-aware interpolation and empirical RoPE theta optimization, followed by progressive training on increasing context lengths. The training involved 1.4 billion tokens across all stages, a minimal fraction of Llama 3's original pre-training data.

Key Capabilities

Massive Context Window: Supports context lengths up to 1048K tokens, enabling processing of extremely long documents and conversations.
Strong Retrieval & Q&A: Ranks highly in RULER evaluations for retrieval and question-answering tasks, performing comparably to much larger models like GPT-4 and Yi.
Efficient Training Approach: Demonstrates that state-of-the-art LLMs can achieve long-context capabilities with minimal additional training data by optimizing RoPE theta.
Enhanced Chat Ability: Further fine-tuned to strengthen its assistant-like conversational capabilities.

Good for

Applications requiring processing and understanding of very long documents or extensive conversational history.
Retrieval-augmented generation (RAG) systems where large amounts of information need to be queried.
Developing custom AI agents that benefit from deep contextual understanding.

Overview

Overview

Key Capabilities

Good for

Full Model Card (README)