lang-uk/OmniGEC-Minimal-12B
The lang-uk/OmniGEC-Minimal-12B model is a 12 billion parameter language model based on the Gemma-3-12B-IT architecture, developed by lang-uk. It is instruction-tuned and supervised fine-tuned on the OmniGEC corpus, specializing in grammatical error correction (GEC) for 11 low- and mid-resource European languages. This model achieves state-of-the-art results for paragraph-based editing in minimal and fluency tracks, making it suitable for high-quality, multilingual text correction.
Loading preview...
OmniGEC-Minimal-12B: Multilingual Grammatical Error Correction
OmniGEC-Minimal-12B is a 12 billion parameter model from lang-uk, built upon the Gemma-3-12B-IT architecture. It has been extensively instruction-tuned and supervised fine-tuned using the OmniGEC corpus, a silver-standard GEC dataset. This corpus integrates MultiGEC-25, Wikipedia, and Reddit edits across 11 European languages, including Czech, English, Estonian, German, Greek, Italian, Latvian, Slovenian, Swedish, and Ukrainian.
Key Capabilities
- Paragraph-level Correction: Excels at correcting grammatical errors within entire paragraphs, not just individual sentences.
- Multilingual Support: Provides robust GEC capabilities for 11 low- and mid-resource European languages.
- State-of-the-Art Performance: Achieves SOTA results for paragraph-based editing in both minimal and fluency tracks, surpassing baseline models like LLaMA-3-8B by 9–10 GLEU points on the minimal track.
- Enhanced for Specific Languages: Delivers the current best open scores for Estonian and Latvian on the MultiGEC-25 test set.
Training and Evaluation
The model was trained on a diverse dataset including WikiEdits-MultiGEC (human Wikipedia revisions), Reddit-MultiGEC (posts from language-specific subreddits with GPT-4o-mini corrections), and MultiGEC-25 golden shared-task data. Evaluation was performed using the GLEU metric via the official MultiGEC-25 CodaLab evaluator.
Good For
- Automated Text Correction: Ideal for applications requiring high-quality grammatical error correction in multiple European languages.
- Content Refinement: Useful for improving the fluency and correctness of written content in supported languages.
- Research in GEC: Provides a strong baseline and SOTA performance for further research in multilingual GEC, particularly for low-resource languages.