SozKZ Fix Qwen 500M — Kazakh GEC v3 Overview

This model, developed by Saken Tukenov, is a specialized 447 million parameter Qwen-based language model designed for Kazakh Grammatical Error Correction (GEC). It addresses a wide range of errors including spelling (емле), grammar, punctuation, and word usage in Kazakh text. The model leverages a LoRA SFT (r=64, alpha=128) fine-tuning method, merged into the base model.

Key Capabilities and Features

Comprehensive Error Correction: Fixes spelling, grammar, punctuation, and word usage errors in Kazakh.
Advanced Емле Fixer Pipeline: Incorporates a dictionary-based pre-fixer to convert russified characters to their correct Kazakh equivalents (e.g., "бугін" → "бүгін") and a post-fixer to catch remaining character errors. This pipeline significantly improves correction accuracy.
Targeted Training Data: Trained on 14,597 examples, including 5,000 additional synthetic pairs specifically focusing on емле and morphology errors.
Optimized Inference: Utilizes beam search (num_beams=4) with edit-distance reranking to select the most accurate correction candidate, leading to a 93% accuracy on real-world test cases (a substantial improvement over v2's ~60%).
Small Footprint: With 447M parameters, it offers efficient deployment while maintaining high performance for its specialized task.

When to Use This Model

This model is ideal for applications requiring high-accuracy grammatical and spelling correction for Kazakh language text. Its unique pre/post-processing pipeline makes it particularly effective for handling common Kazakh character substitution errors. It's suitable for integrating into text editors, communication platforms, or any system where clean, grammatically correct Kazakh output is critical. Developers should note the critical inference setting of repetition_penalty=1.0 for optimal quality.

Overview

SozKZ Fix Qwen 500M — Kazakh GEC v3 Overview

Key Capabilities and Features

When to Use This Model

Full Model Card (README)