Marin 8B Base: An Open-Source Foundation Model

Marin 8B Base is an 8 billion parameter autoregressive language model developed by the Marin team at Stanford CRFM, part of a collaborative effort to create open-source foundation models. Built on the Llama architecture and utilizing a variant of the Llama 3 tokenizer, it offers a 32768-token context length.

Key Capabilities & Training

Marin 8B Base was trained extensively on 12.7 trillion tokens across multiple pre-training phases (Kestrel, Ocelot, Jellyfish, Phoenix, Starling, Deeper Starling). Its diverse training data includes:

General Web Data: Nemotron-CC, DCLM Baseline, Dolma (including MegaWika, peS2o).
Code & Math: Starcoder Data, Proofpile 2, FineMath, Dolmino-Mix-1124 (including FLAN, CodeSearchNet, GSM8K, MetaMath, MathCoder2 Synthetic).
Specialized Datasets: Marin Markdownified StackExchange, Wikipedia, Ar5iv, and Datashop Science QA.

Performance Highlights

Benchmarked against models like Llama 3.1 8B, OLMo 2 7B, and MAP NEO 7B using LM Eval Harness, Marin 8B Base (Deeper Starling checkpoint) achieves an average score of 66.6, often surpassing competitors in tasks such as ARC Challenge (63.1), BBH (50.6), HellaSwag (83.6/82.3), and MMLU (67.6/65.9). It demonstrates particular strength in reasoning and general knowledge tasks.

Important Considerations

As a base model, Marin 8B has not undergone any safety tuning or evaluation. Users should exercise caution and verify responses, as it can generate harmful or inaccurate content. It is not intended for fully autonomous use.

Overview

Marin 8B Base: An Open-Source Foundation Model

Key Capabilities & Training

Performance Highlights

Important Considerations

Full Model Card (README)