stukenov/sozkz-fix-qwen-500m-kk-gec-v3
The stukenov/sozkz-fix-qwen-500m-kk-gec-v3 is a 447 million parameter Qwen-based model developed by Saken Tukenov, specifically fine-tuned for Kazakh grammatical error correction (GEC). This model excels at fixing spelling, grammar, punctuation, and word usage errors in Kazakh text, incorporating a unique dictionary-based pre/post-processing pipeline for character substitution errors. It achieves approximately 93% accuracy on real-world test cases for comprehensive Kazakh text correction.
Loading preview...
SozKZ Fix Qwen 500M — Kazakh GEC v3 Overview
This model, developed by Saken Tukenov, is a specialized 447 million parameter Qwen-based language model designed for Kazakh Grammatical Error Correction (GEC). It addresses a wide range of errors including spelling (емле), grammar, punctuation, and word usage in Kazakh text. The model leverages a LoRA SFT (r=64, alpha=128) fine-tuning method, merged into the base model.
Key Capabilities and Features
- Comprehensive Error Correction: Fixes spelling, grammar, punctuation, and word usage errors in Kazakh.
- Advanced Емле Fixer Pipeline: Incorporates a dictionary-based pre-fixer to convert russified characters to their correct Kazakh equivalents (e.g., "бугін" → "бүгін") and a post-fixer to catch remaining character errors. This pipeline significantly improves correction accuracy.
- Targeted Training Data: Trained on 14,597 examples, including 5,000 additional synthetic pairs specifically focusing on емле and morphology errors.
- Optimized Inference: Utilizes beam search (
num_beams=4) with edit-distance reranking to select the most accurate correction candidate, leading to a 93% accuracy on real-world test cases (a substantial improvement over v2's ~60%). - Small Footprint: With 447M parameters, it offers efficient deployment while maintaining high performance for its specialized task.
When to Use This Model
This model is ideal for applications requiring high-accuracy grammatical and spelling correction for Kazakh language text. Its unique pre/post-processing pipeline makes it particularly effective for handling common Kazakh character substitution errors. It's suitable for integrating into text editors, communication platforms, or any system where clean, grammatically correct Kazakh output is critical. Developers should note the critical inference setting of repetition_penalty=1.0 for optimal quality.