Vikhr-Nemo-12B-Instruct-R-21-09-24: Enhanced Bilingual LLM
Vikhr-Nemo-12B-Instruct-R-21-09-24 is a 12 billion parameter large language model developed by VikhrModels, building upon the Mistral-Nemo-Instruct-2407 architecture. It is specifically adapted and optimized for high-quality generation in both Russian and English, with support for other languages. The model boasts a substantial context length of up to 128k tokens, inherited from its base model.
Key Capabilities & Features
- Bilingual Proficiency: High-quality generation in Russian and English, supported by the custom Grandmaster-PRO-MAX dataset.
- Optimized for Diverse Tasks: Excels in reasoning, summarization, code generation, roleplay, and dialogue.
- Advanced RAG Mode: Features a unique "Grounded RAG" mode, inspired by Command-R, allowing the model to identify relevant document identifiers and use them for grounded answers. This mode requires a specific
GROUNDED_SYSTEM_PROMPTand supports Markdown, HTML, or Plain Text document content. - System Prompt Support: Allows for regulating response style, ideally using English system prompts.
- Training Methodology: Developed using a multi-stage process including SFT with a 150k instruction synthetic dataset and alignment via SMPO, a custom DPO variation, for improved answer quality.
Performance Highlights
On the ru-arena-general benchmark, Vikhr-Nemo-12B-Instruct-R-21-09-24 achieved a winrate of 79.8% against gpt-3.5-turbo-0125 (which has a 50% winrate as a reference). In RAG benchmarks, it demonstrated strong performance, with 68% judge-correct-percent for in-domain questions and 92% for out-of-domain questions, outperforming gpt-4o-mini and gpt-3.5-turbo-0125 in some metrics.
Limitations & Recommendations
- Safety: The model has a low level of safety by default, prioritizing instruction following. Users should implement their own safety measures.
- System Prompts: Best used for style specification (e.g., "answer only in json format") and preferably in English.
- Temperature: Recommended to use with low temperature (0.1-0.5) and
top_k(30-50) to avoid generation defects.