shibing624/chinese-text-correction-7b

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Oct 12, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

The shibing624/chinese-text-correction-7b model is a 7.6 billion parameter instruction-tuned causal language model developed by shibing624, based on Qwen/Qwen2.5-7B-Instruct. It is specifically fine-tuned for Chinese text correction, excelling at both spelling and grammar errors, including those involving length-aligned and length-unaligned corrections. With a context length of 131072 tokens, this model is optimized for high-accuracy Chinese text correction tasks.

Loading preview...

Overview

shibing624/chinese-text-correction-7b is a 7.6 billion parameter model built upon the Qwen/Qwen2.5-7B-Instruct architecture, specifically fine-tuned for Chinese text correction (CTC). This model addresses both Chinese Spelling Correction (CSC), which handles sound-alike, shape-alike, and grammar errors with length-aligned corrections, and broader CTC, which also includes multi-character or missing-character errors that are length-unaligned.

Key Capabilities

  • Comprehensive Chinese Text Correction: Corrects spelling, grammar, and structural errors in Chinese text.
  • Handles Varied Error Types: Capable of correcting both length-aligned errors (e.g., homophones, similar-looking characters) and length-unaligned errors (e.g., missing or extra characters).
  • Strong Performance: Achieves an average F1 score of 0.8225 across various benchmarks, notably 0.9798 on EC-LAW and 0.9959 on MCSC.
  • Integration: Designed to be used with the pycorrector library for easy integration into correction workflows, or directly via Hugging Face Transformers.

Training Details

  • Base Model: Qwen/Qwen2.5-7B-Instruct.
  • Dataset: Trained on the shibing624/chinese_text_correction dataset.
  • Parameters: Trained for 8 epochs with a batch size of 2 over 36,000 steps.

Good for

  • Applications requiring high-accuracy Chinese spelling and grammar correction.
  • Developers looking for a specialized model to improve the quality of Chinese text input or generated content.
  • Integration into larger NLP pipelines for pre-processing or post-processing Chinese text.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p