Shisa V2 Llama 3.1 8B Overview

Shisa.AI's shisa-v2-llama3.1-8b is an 8 billion parameter model from the Shisa V2 family, designed for robust bilingual performance in Japanese and English. This model leverages the Llama 3.1 architecture and features a 32,768 token context window. Unlike previous iterations that focused on tokenizer extension or continued pre-training, Shisa V2 models, including this 8B variant, are optimized through extensive post-training with refined synthetic data approaches.

Key Capabilities & Performance

Bilingual Proficiency: Achieves strong performance in both Japanese (JA Avg: 70.83) and English (EN Avg: 54.75) tasks.
Enhanced Japanese Output: Demonstrates significant improvements in Japanese output quality compared to its base model, Llama-3.1-8B-Instruct, across various benchmarks like JA MT Bench, Rakuda, and Tengu.
Instruction Following: Shows strong instruction-following abilities, particularly in Japanese contexts, as measured by custom shisa-jp-ifeval and shisa-jp-rp-bench benchmarks.
Role-Playing & Translation: Excels in Japanese role-play scenarios and Japanese-English translation tasks, with specific DPO datasets targeting these areas.

Training & Datasets

The model was trained using a diverse set of approximately 360K supervised fine-tuning (SFT) samples and 113K DPO samples. Key datasets include a refined shisa-v2-sharegpt for JA/EN, translated prompts from Rewild, and specialized synthetic datasets for Japanese role-playing, instruction following, and translation. The DPO stage notably utilized an English-only deepseekv3-ultrafeedback-armorm-dpo set, which surprisingly outperformed bilingual DPO sets for alignment.

Usage Considerations

Chat Templates: Inherits chat templates from its Llama 3.1 base model.
Inference: Validated for use with vLLM and SGLang.
Temperature Settings: Recommended lower temperatures (e.g., 0.2) for translation accuracy and higher temperatures (e.g., 1.0) for creative or role-play tasks.
Safety: No additional safety alignment beyond the base model; inherits base model biases and safety profiles.