inspirebek/qwen3-4b-uzbek-v2
The inspirebek/qwen3-4b-uzbek-v2 is a 4 billion parameter Qwen3-based language model fine-tuned by inspirebek, specifically optimized for the Uzbek language. It features a 32768-token context length and excels as an Uzbek-first chat assistant, demonstrating significant improvements in Uzbek language understanding and generation over its base model. This model is designed for applications requiring strong performance in Uzbek, while also maintaining English capabilities.
Loading preview...
Overview
inspirebek/qwen3-4b-uzbek-v2 is a 4 billion parameter Qwen3-based language model, specifically fine-tuned for the Uzbek language. This model addresses the challenge of adapting English-dominant base models to new languages by expanding the LoRA configuration to include embed_tokens and lm_head, which are crucial for re-mapping the vocabulary. This approach significantly improved performance on Uzbek benchmarks, with MMLU-uz jumping to 40.50% from a near-random baseline.
Key Capabilities
- Uzbek Language Proficiency: Achieves 40.50% on MMLU-uz and 33.42% on UzLib, demonstrating strong understanding and generation in Uzbek.
- Dual-Stage Fine-tuning: Utilizes a two-stage LoRA fine-tuning process, including continued pretraining on native Uzbek text and supervised fine-tuning on chat-formatted Uzbek instructions.
- Efficient Training: Employs
unslothandpeftwith specific LoRA configurations (r=64,alpha=128,use_rslora=True) and a dual learning rate strategy to optimize training within compute constraints. - Robustness: Features a
TrainerCallbackfor pushing checkpoints to Hugging Face, enabling seamless resumption of training after compute timeouts.
Intended Use Cases
- Uzbek-first Chat Assistants: Designed primarily for conversational AI applications in Uzbek.
- Multilingual Applications: Capable in English as well, making it suitable for scenarios requiring both Uzbek and English language support.
- Research and Development: Serves as a research artifact for exploring language model adaptation to low-resource languages. Users should be aware of the
CC-BY-NC-4.0license on some training data, which restricts commercial use of derivative models unless those subsets are excluded.