Overview
The openfoodfacts/spellcheck-mistral-7b is a 7 billion parameter language model developed by Open Food Facts, fine-tuned from the Mistral-7B-v0.3 base model. Its primary purpose is to correct spelling errors and typos in lists of food ingredients, especially those generated from OCR scans of product labels. This model addresses data quality issues arising from poor image quality, such as low-definition pictures or curved products, which often lead to inaccuracies in ingredient extraction.
Key Capabilities
- Ingredient Spellchecking: Corrects typos in ingredient lists based on a specific guideline.
- OCR Error Mitigation: Designed to improve the accuracy of ingredient data initially extracted via OCR.
- Custom Training: Fine-tuned on thousands of corrected ingredient lists from the Open Food Facts database.
Performance Metrics
The model is evaluated using a custom sequence alignment algorithm to determine correction and localization precision and recall. Current metrics include:
- Correction precision: 0.67
- Correction recall: 0.62
- Localization precision: 0.75
- Localization recall: 0.69
Usage
Users must embed the ingredient list into a specific prompt format to ensure proper inference, as demonstrated in the model's documentation. The training dataset and evaluation benchmark are publicly available on Hugging Face, and a demo is accessible via Hugging Face Spaces.