rinna/youri-7b: Japanese-Optimized Llama-2 Continual Pre-training
rinna/youri-7b is a 7-billion parameter language model developed by rinna, built upon the Llama-2-7b architecture. Its primary distinction lies in its continual pre-training on approximately 40 billion tokens from a diverse mixture of Japanese and English datasets, including Japanese CC-100, C4, OSCAR, The Pile, Wikipedia, and rinna's curated Japanese dataset. This extensive training significantly enhances its performance on Japanese language tasks.
Key Capabilities & Features
- Japanese Language Proficiency: Substantially improved capabilities for Japanese text generation and understanding due to specialized continual pre-training.
- Llama-2 Foundation: Inherits the robust 32-layer, 4096-hidden-size transformer architecture of Llama-2-7b.
- Standard Tokenization: Utilizes the original Llama-2 tokenizer.
- Benchmarking: Performance metrics are available on rinna's LM benchmark page and the Open LLM Leaderboard.
Ideal Use Cases
- Japanese NLP Applications: Recommended for tasks requiring strong Japanese language generation, comprehension, and translation.
- Research & Development: Suitable for researchers and developers exploring multilingual LLMs, particularly those focusing on Japanese language models.