Marin 8B Instruct: An Overview
Marin 8B Instruct is an 8 billion parameter instruction-tuned model developed by the Marin team at Stanford CRFM, leveraging the Llama 3 architecture. It is an SFT-only model, fine-tuned on a comprehensive mix of datasets to enhance its instruction-following capabilities.
Key Capabilities & Training
- Architecture: Based on the Llama 3 8B architecture, ensuring compatibility with standard Hugging Face Transformers libraries.
- Tokenizer: Utilizes a variant of the Llama 3 tokenizer,
stanford-crfm/marin-tokenizer, which includes a bundled chat template. - Instruction Tuning: Trained on diverse SFT datasets such as AceCode-89K, Bespoke-Stratos-17k, dolphin-r1 (including reasoning subsets), natural_reasoning, OpenThoughts-114k-math, smoltalk, tulu-3-sft-mixture, and verifiable-math-problems.
- Pre-training: The base model underwent extensive pre-training across multiple phases (Kestrel, Ocelot, Jellyfish, Phoenix, Starling, Deeper Starling) on datasets like Nemotron-CC, DCLM Baseline, Starcoder Data, Proofpile 2, FineMath, Dolma, and custom Marin Markdownified datasets (StackExchange, Wikipedia, Ar5iv).
- Performance: Marin 8B Base demonstrates competitive performance against models like Llama 3.1 8B, OLMo 2 7B, and MAP NEO 7B on LM Eval Harness benchmarks, often achieving higher average scores and excelling in tasks like ARC Easy, ARC Challenge, BBH, and MMLU.
Considerations for Use
- Safety: Marin 8B has not undergone specific safety tuning or evaluation. Users should exercise caution and consider potential risks, as the model can generate harmful or sensitive content, and its responses may require verification.
- Intended Use: This model is not intended for fully autonomous use and should be deployed with appropriate safeguards and human oversight.
For more detailed information on the pre-training process, refer to the technical retrospective.