sbintuitions/sarashina2-70b
Sarashina2-70B is a 70 billion parameter causal language model developed by SB Intuitions, built on the Llama2 architecture with a RoPE position type. Trained on 2.1 trillion tokens, including a significant portion of Japanese Common Crawl data and English SlimPajama, it features a 102400-token vocabulary. This model is designed for general language tasks, with a particular emphasis on Japanese language processing, and has an 8192-token context length.
Loading preview...
Sarashina2-70B: A Llama2-based Japanese-centric LLM
Sarashina2-70B is a large language model developed by SB Intuitions, leveraging the Llama2 architecture. This 70 billion parameter model is characterized by its RoPE position type and a substantial 102400-token vocabulary.
Key Characteristics
- Architecture: Based on the robust Llama2 framework, providing a strong foundation for language understanding and generation.
- Training Data: Trained on an extensive 2.1 trillion tokens, comprising a significant 1 trillion tokens of Japanese Common Crawl data, processed with CCNet and HojiChar for cleaning, alongside English documents from SlimPajama (excluding books3).
- Tokenization: Utilizes a sentencepiece tokenizer with a unigram language model and byte-fallback, designed to directly process raw sentences without pre-tokenization for Japanese.
- Scalability: Part of a family of models, with 7B, 13B, and 70B parameter versions, all sharing the same training token count and vocabulary size.
Considerations for Use
It is important to note that Sarashina2-70B has not been instruction-tuned. Consequently, it may produce irrelevant, inaccurate, or biased outputs. Developers are advised to fine-tune the model for specific applications, incorporating human preferences and safety considerations, before deployment.