Falcon-RW-7B: A Research Model for Web Data Influence

Falcon-RW-7B is a 7.8 billion parameter causal decoder-only language model developed by TII. Its primary distinction lies in its training data: 350 billion tokens exclusively from RefinedWeb, a high-quality, filtered, and deduplicated web dataset. This model serves as a research artifact to investigate how training solely on web data impacts large language model properties, such as fairness, safety, and capabilities.

Key Characteristics:

Training Data Focus: Trained entirely on RefinedWeb, demonstrating that web-only data can yield performance matching or surpassing models trained on curated datasets.
Architecture: Adapts the GPT-3 architecture, enhanced with ALiBi for improved context handling and FlashAttention for efficiency.
Language: English-only model.
License: Available under the Apache 2.0 license.

Intended Use:

Direct Use: Primarily for research into the influence of web data on LLMs.
Out-of-Scope Use: Not recommended for production use without thorough risk assessment and mitigation. For general-purpose, state-of-the-art applications, TII recommends using Falcon-7B or Falcon-40B, which were trained on significantly larger and more diverse datasets.

Limitations:

As an English-only model, it will not generalize to other languages.
Inherits stereotypes and biases present in large-scale web corpora.

Users are encouraged to fine-tune the model for specific tasks and implement guardrails for any potential production deployment.

Overview

Falcon-RW-7B: A Research Model for Web Data Influence

Key Characteristics:

Intended Use:

Limitations:

Full Model Card (README)