Sarashina2-70B: A Llama2-based Japanese-centric LLM

Sarashina2-70B is a large language model developed by SB Intuitions, leveraging the Llama2 architecture. This 70 billion parameter model is characterized by its RoPE position type and a substantial 102400-token vocabulary.

Key Characteristics

Architecture: Based on the robust Llama2 framework, providing a strong foundation for language understanding and generation.
Training Data: Trained on an extensive 2.1 trillion tokens, comprising a significant 1 trillion tokens of Japanese Common Crawl data, processed with CCNet and HojiChar for cleaning, alongside English documents from SlimPajama (excluding books3).
Tokenization: Utilizes a sentencepiece tokenizer with a unigram language model and byte-fallback, designed to directly process raw sentences without pre-tokenization for Japanese.
Scalability: Part of a family of models, with 7B, 13B, and 70B parameter versions, all sharing the same training token count and vocabulary size.

Considerations for Use

It is important to note that Sarashina2-70B has not been instruction-tuned. Consequently, it may produce irrelevant, inaccurate, or biased outputs. Developers are advised to fine-tune the model for specific applications, incorporating human preferences and safety considerations, before deployment.

Overview

Sarashina2-70B: A Llama2-based Japanese-centric LLM

Key Characteristics

Considerations for Use

Full Model Card (README)