marin-community/marin-8b-base
Marin 8B Base is an 8 billion parameter autoregressive language model developed by the Marin team at Stanford CRFM, built on the Llama architecture with a 32768-token context length. Trained on 12.7 trillion tokens across diverse datasets including Nemotron-CC, Dolma, and specialized Markdownified datasets, it demonstrates strong performance across a range of benchmarks, often outperforming Llama 3.1 8B and other 7-8B open-source models in its class. This base model is designed for broad applications, particularly excelling in reasoning and general language understanding tasks.
Loading preview...
Marin 8B Base: An Open-Source Foundation Model
Marin 8B Base is an 8 billion parameter autoregressive language model developed by the Marin team at Stanford CRFM, part of a collaborative effort to create open-source foundation models. Built on the Llama architecture and utilizing a variant of the Llama 3 tokenizer, it offers a 32768-token context length.
Key Capabilities & Training
Marin 8B Base was trained extensively on 12.7 trillion tokens across multiple pre-training phases (Kestrel, Ocelot, Jellyfish, Phoenix, Starling, Deeper Starling). Its diverse training data includes:
- General Web Data: Nemotron-CC, DCLM Baseline, Dolma (including MegaWika, peS2o).
- Code & Math: Starcoder Data, Proofpile 2, FineMath, Dolmino-Mix-1124 (including FLAN, CodeSearchNet, GSM8K, MetaMath, MathCoder2 Synthetic).
- Specialized Datasets: Marin Markdownified StackExchange, Wikipedia, Ar5iv, and Datashop Science QA.
Performance Highlights
Benchmarked against models like Llama 3.1 8B, OLMo 2 7B, and MAP NEO 7B using LM Eval Harness, Marin 8B Base (Deeper Starling checkpoint) achieves an average score of 66.6, often surpassing competitors in tasks such as ARC Challenge (63.1), BBH (50.6), HellaSwag (83.6/82.3), and MMLU (67.6/65.9). It demonstrates particular strength in reasoning and general knowledge tasks.
Important Considerations
As a base model, Marin 8B has not undergone any safety tuning or evaluation. Users should exercise caution and verify responses, as it can generate harmful or inaccurate content. It is not intended for fully autonomous use.