tokyotech-llm/Swallow-70b-NVE-hf
The Swallow-70b-NVE-hf model by tokyotech-llm is a 69 billion parameter language model continually pre-trained from the Llama 2 family, with a strong focus on Japanese language data. This specific variant, NVE (No Vocabulary Expansion), utilizes a standard Llama 2 tokenizer without Japanese vocabulary additions. It excels in Japanese language tasks, demonstrating significant performance improvements over its Llama 2 base, while maintaining competitive English language capabilities.
Loading preview...
Swallow-70b-NVE-hf: Japanese-Enhanced Llama 2 Variant
Swallow-70b-NVE-hf is a 69 billion parameter large language model developed by tokyotech-llm, built upon the Llama 2 architecture. It has undergone extensive continual pre-training, primarily incorporating a substantial amount of Japanese language data, alongside datasets like RefinedWeb, Swallow Corpus, and The Pile. This model is part of the "No Vocabulary Expansion" (NVE) series, meaning it uses the original Llama 2 tokenizer without additional Japanese vocabulary, which can be a consideration for tokenization efficiency compared to models with expanded vocabularies.
Key Capabilities & Performance
The model demonstrates strong performance in Japanese language tasks, significantly outperforming the base Llama 2 70B model across various benchmarks such as JCommonsenseQA, JEMHopQA, NIILC, JSQuAD, XL-Sum, MGSM, and WMT20 machine translation. For instance, it achieves 0.9410 on JCommonsenseQA and 0.7024 on NIILC, surpassing Llama 2's scores. While optimized for Japanese, it maintains competitive, though slightly lower, performance on English benchmarks compared to the Llama 2 70B model.
Use Cases & Considerations
This model is particularly well-suited for applications requiring high proficiency in Japanese language understanding and generation. Its continual pre-training on Japanese data makes it a strong candidate for tasks like Japanese question answering, summarization, and translation. Developers should note its NVE characteristic, which implies a standard Llama 2 tokenizer, and consider this when evaluating tokenization efficiency for specific Japanese text processing workflows. The model is still in early research and development stages and has not been extensively tuned for safety or human alignment.