OpenLLaMA 7Bv2 Model Overview

This model, OpenLLaMA 7Bv2, is a 7 billion parameter language model developed to provide high-quality and contextually relevant text predictions. It distinguishes itself through its training on a highly diverse composite dataset, which includes a wide array of sources to ensure broad applicability and robust understanding across various domains.

Key Training Data Sources

The model's training leveraged a rich and varied dataset, contributing to its comprehensive knowledge base:

Falcon refined-web dataset: For general web-crawled information.
starcoder datasets: Likely contributing to code-related understanding.
Wikipedia: Providing encyclopedic knowledge.
arXiv: Incorporating academic papers for scientific understanding.
Extensive book collections: Covering multiple genres for broad literary context.
Stack Exchange data: Curated by RedPajama, offering question-answer pairs and technical discussions.

Training Procedure Highlights

The training process for OpenLLaMA 7Bv2 was meticulously designed for efficiency and performance:

Learning Rate: Utilized a maximum learning rate of 3e-4 and a minimum of 3e-5.
Batch Size: Employed a substantial batch size of 4 million tokens.
Learning Rate Scheduler: The scheduling strategy closely mirrors that used in Llama2, ensuring stable and optimal convergence during training.

Overview

OpenLLaMA 7Bv2 Model Overview

Key Training Data Sources

Training Procedure Highlights

Full Model Card (README)