Overview
Virtuoso-Small-v2: Deepseek-v3 Distillation
Virtuoso-Small-v2 is a 14.8 billion parameter language model developed by arcee-ai, based on the Qwen-2.5-14B architecture. It distinguishes itself through a unique distillation process from Deepseek-v3, utilizing over 5 billion tokens worth of logits. This method, which includes "tokenizer surgery" for cross-architecture compatibility and proprietary "fusion merging," aims for precise knowledge transfer rather than standard supervised fine-tuning.
Key Capabilities
- Advanced Reasoning: Excels in technical and scientific queries.
- Complex Code Generation: Optimized for generating intricate code.
- Mathematical Problem-Solving: Demonstrates strong performance in mathematical tasks.
- Extended Context: Supports a context length of 128k tokens.
Training Highlights
- Logit-Level Distillation: Trained on approximately 1.1 billion tokens of Deepseek-v3 logits.
- Fusion Merging: Employs a specialized merging technique to maximize fidelity to the teacher model.
- Alignment: Includes DPO (Direct Preference Optimization) to enhance alignment and reduce hallucinations.
This model is released under the Apache-2.0 License, allowing for broad commercial and non-commercial use.