PleIAs/OCRonos

Loading
Public
8B
FP8
8192
Jun 1, 2024
License: apache-2.0
Hugging Face
Overview

OCRonos: Specialized Model for Badly Digitized Text Correction

OCRonos is a series of specialized language models developed by PleIAs, primarily based on the Llama-3-8B architecture, designed to address the challenges of badly digitized texts. As part of the Bad Data Toolbox, its core function is to correct OCR errors, resolve incorrect word segmentation (cut/merge issues), and restore overall broken text structures.

Key Capabilities & Features

  • OCR Error Correction: Specifically trained to rectify inaccuracies introduced during the optical character recognition process.
  • Text Structure Restoration: Addresses issues like wrong word cuts/merges and general structural damage in digitized documents.
  • Diverse Training Data: Trained on a wide array of ocrized texts from PleIAs' open pre-training corpus, including cultural heritage sources (Common Corpus) and financial/administrative documents (Finance Commons).
  • Language Switching Mitigation: Significantly reduces the common LLM issue of language switching when processing noisy input, a problem observed in smaller generalist models.
  • Faithful Correction: Generally adheres to the original material, providing sensible restitution of deteriorated text without rewriting correct words.

Ideal Use Cases

  • Making Challenging Resources Usable: Perfect for scenarios where original PDF sources are too damaged for accurate OCRization or are difficult to retrieve.
  • LLM Applications & Search Retrieval: Enhances the usability of deteriorated text for downstream LLM tasks and improves search accuracy.
  • Historical & Financial Document Processing: Particularly effective for cultural heritage texts and financial/administrative documents due to its specialized training data.

OCRonos aims to provide a robust solution for transforming highly deteriorated content into usable text, acting as a precise correction tool even for severely damaged inputs.