OpenLLaMA 7Bv2 Overview

OpenLLaMA 7Bv2 is a 7 billion parameter language model focused on generating high-quality, contextually relevant text. It distinguishes itself through its comprehensive training on a diverse composite dataset, which includes web-crawled data, scholarly articles, and a wide array of literature and question-answer pairs.

Key Training Details

The model's training incorporated a rich dataset comprising:

Falcon refined-web dataset: For broad internet knowledge.
Starcoder datasets: Likely contributing to code-related understanding.
Wikipedia: Providing encyclopedic knowledge.
arXiv: For scientific and academic comprehension.
Extensive book collections: Covering multiple genres.
RedPajama's Stack Exchange data: Enhancing question-answering capabilities.

The training procedure utilized a maximum learning rate of 3e-4, a minimum of 3e-5, and a substantial batch size of 4 million tokens. Its learning rate scheduling strategy closely mirrors that of Llama2, aiming for optimal convergence.

Potential Use Cases

Given its diverse training data, OpenLLaMA 7Bv2 is well-suited for:

General text generation: Creating coherent and contextually appropriate text for various applications.
Content summarization: Condensing information from diverse sources.
Question answering: Providing informed responses based on its broad knowledge base.
Research assistance: Aiding in understanding academic and scientific texts.

Overview

OpenLLaMA 7Bv2 Overview

Key Training Details

Potential Use Cases

Full Model Card (README)