QVikhr-2.5-1.5B-Instruct-SMPO Overview
QVikhr-2.5-1.5B-Instruct-SMPO is a 1.5 billion parameter instruction-following language model from Vikhrmodels, built upon the Qwen-2.5-1.5B-Instruct architecture. Its primary distinction lies in its specialized alignment process using Simple Margin Preference Optimization (SMPO), a method designed to improve the stability and control of preference training, particularly in conjunction with Rejection Sampling.
Key Capabilities & Training:
- Bilingual Support: Optimized for Russian (RU) language tasks, while also supporting English (EN).
- Advanced Alignment: Utilizes SMPO for fine-tuning, a technique developed by Vikhrmodels to enhance response quality through preference optimization.
- Training Data: Aligned on a high-quality, deduplicated subset of the GrandMaster-PRO-MAX Russian dataset (approximately 10k dialogues).
- Reward Model: Leveraged Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 as the reward model during the alignment process.
- Rejection Sampling: Employed rejection sampling with 7 hypotheses generated from the Vikhr-Qwen-2.5-1.5B-Instruct SFT checkpoint to create the preference dataset.
Good For:
- Applications requiring a compact (1.5B) yet capable model for Russian language generation and instruction following.
- Use cases where improved response quality and alignment stability are critical, benefiting from the SMPO methodology.
- Developers interested in exploring models fine-tuned with advanced preference optimization techniques for bilingual (RU/EN) contexts.