shibing624/chinese-text-correction-7b
The shibing624/chinese-text-correction-7b model is a 7.6 billion parameter instruction-tuned causal language model developed by shibing624, based on Qwen/Qwen2.5-7B-Instruct. It is specifically fine-tuned for Chinese text correction, excelling at both spelling and grammar errors, including those involving length-aligned and length-unaligned corrections. With a context length of 131072 tokens, this model is optimized for high-accuracy Chinese text correction tasks.
Loading preview...
Overview
shibing624/chinese-text-correction-7b is a 7.6 billion parameter model built upon the Qwen/Qwen2.5-7B-Instruct architecture, specifically fine-tuned for Chinese text correction (CTC). This model addresses both Chinese Spelling Correction (CSC), which handles sound-alike, shape-alike, and grammar errors with length-aligned corrections, and broader CTC, which also includes multi-character or missing-character errors that are length-unaligned.
Key Capabilities
- Comprehensive Chinese Text Correction: Corrects spelling, grammar, and structural errors in Chinese text.
- Handles Varied Error Types: Capable of correcting both length-aligned errors (e.g., homophones, similar-looking characters) and length-unaligned errors (e.g., missing or extra characters).
- Strong Performance: Achieves an average F1 score of 0.8225 across various benchmarks, notably 0.9798 on EC-LAW and 0.9959 on MCSC.
- Integration: Designed to be used with the
pycorrectorlibrary for easy integration into correction workflows, or directly via Hugging Face Transformers.
Training Details
- Base Model: Qwen/Qwen2.5-7B-Instruct.
- Dataset: Trained on the shibing624/chinese_text_correction dataset.
- Parameters: Trained for 8 epochs with a batch size of 2 over 36,000 steps.
Good for
- Applications requiring high-accuracy Chinese spelling and grammar correction.
- Developers looking for a specialized model to improve the quality of Chinese text input or generated content.
- Integration into larger NLP pipelines for pre-processing or post-processing Chinese text.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.