tartuNLP/Llammas-base-p1-GPT-4o-human-error-mix-paragraph-GEC
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Feb 11, 2025License:llama2Architecture:Transformer Open Weights Warm

The tartuNLP/Llammas-base-p1-GPT-4o-human-error-mix-paragraph-GEC model, developed by Vainikko et al., is a grammatical error correction model specifically designed for paragraph-level input. This model is part of a case study focused on Estonian language learners, leveraging synthetic training data generated by proprietary large language models. It excels at correcting whole paragraphs of text, addressing the scarcity of error correction data for Estonian. The model is released with open weights, alongside its synthetic training data, making it suitable for research and application in Estonian language processing.

Loading preview...

Overview

This model, developed by tartuNLP (Vainikko et al.), is a specialized grammatical error correction (GEC) model designed to process and correct entire paragraphs of text. It was created as part of a case study focusing on Estonian language learners, addressing the challenge of limited available error correction data and the complete absence of explanation data for Estonian. The approach involves using proprietary large language models to generate synthetic training data, which is then used to train task-specific GEC models.

Key Capabilities

  • Paragraph-level Error Correction: Unlike many GEC models that operate on a sentence-by-sentence basis, this model processes and corrects whole paragraphs, providing a more holistic correction.
  • Estonian Language Focus: Specifically tailored for grammatical error correction in Estonian, making it highly relevant for applications targeting Estonian language users or learners.
  • Synthetic Data Utilization: Leverages synthetic training data generated by advanced LLMs, a crucial innovation for languages with scarce linguistic resources.
  • Open-Weight Release: The model is released with open weights, promoting transparency and enabling further research and development by the community.

Good For

  • Grammatical error correction of Estonian text, particularly for longer passages or paragraphs.
  • Applications aimed at assisting Estonian language learners.
  • Research into GEC methods, especially those involving synthetic data generation and paragraph-level processing.
  • Developers seeking an open-source solution for Estonian GEC with a focus on contextual understanding.