Name: tokyotech-llm/Llama-3.3-Swallow-70B-v0.4 API
Brand: Featherless.ai
Price: 25.00 USD
Availability: InStock
Author: tokyotech-llm

What the fuck is this model about?

Llama-3.3-Swallow-70B-v0.4 is a 70 billion parameter large language model developed by tokyotech-llm, built upon the Meta Llama 3.3 architecture. Its core purpose is to significantly enhance Japanese language capabilities through extensive continual pre-training, while simultaneously preserving the strong English language performance of its base model. The instruction-tuned variants (Instruct) are further refined using synthetic Japanese data.

What makes THIS different from all the other models?

This model distinguishes itself by its specialized focus on bilingual performance, particularly its superior handling of Japanese. It underwent continual pre-training on approximately 315 billion tokens, including a large Japanese web corpus (Swallow Corpus Version 2), Japanese and English Wikipedia, and specialized mathematical and coding content. Benchmarks show Llama-3.3-Swallow-70B-v0.4 achieving a Japanese average score of 0.629, outperforming models like Qwen2.5-72B (0.623) and Llama 3 70B (0.569) on various Japanese tasks, including JEMHopQA, NIILC, and WMT20-en-ja. It also maintains competitive English performance, with an English average score of 0.711, surpassing Llama 3 70B (0.689) and Llama 3.1 70B (0.671).

Should I use this for my use case?

Use this model if:
- Your application requires strong performance in both Japanese and English.
- You need a model that excels in Japanese-specific tasks such as question answering, summarization, and machine translation.
- You are working on projects involving code generation or mathematical reasoning in a bilingual context.
- You are looking for a Llama-family model with enhanced capabilities for the Japanese market.
Consider alternatives if:
- Your use case is exclusively English-centric and does not require Japanese proficiency.
- You need a model specifically fine-tuned for highly niche domains not covered by its general pre-training data.