mychen76/mistral7b_ocr_to_json_v1

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:8kPublished:Oct 5, 2023License:apache-2.0Architecture:Transformer0.1K Open Weights Warm

The mychen76/mistral7b_ocr_to_json_v1 is a 7 billion parameter language model, fine-tuned from Mistral-7B-v0.1, specifically designed to convert OCR-extracted text from receipts and invoices into structured JSON objects. This model leverages the strengths of OCR engines for text recognition and LLMs for structured data generation, aiming to streamline the process of digitizing financial documents. It is optimized for parsing point-of-sale (POS) receipt data, outperforming Llama 2 13B on tested benchmarks for this specific task.

Loading preview...

Overview

The mychen76/mistral7b_ocr_to_json_v1 is a 7 billion parameter model, fine-tuned from the Mistral-7B-v0.1 architecture. Its core purpose is to transform raw OCR (Optical Character Recognition) output from images, particularly receipts and invoices, into well-formed JSON objects. This model addresses the challenge of converting unstructured text data from scanned documents into a structured, machine-readable format.

Key Capabilities

  • OCR to JSON Conversion: Specializes in taking bounding box and text data from OCR engines and structuring it into a JSON format, ideal for receipt and invoice processing.
  • Receipt Data Extraction: Designed to parse common elements found in POS receipts, such as store names, items, prices, taxes, and payment details.
  • Mistral-7B-v0.1 Base: Built upon the Mistral-7B-v0.1, which has demonstrated superior performance compared to Llama 2 13B on various benchmarks.
  • Experimental Model: Positioned as an experimental model, indicating ongoing development and specialized focus.

Use Cases

This model is particularly well-suited for applications requiring automated data extraction from financial documents. Developers can integrate it into workflows where:

  • Receipt Digitization: Converting physical or image-based receipts into structured digital data for accounting, expense tracking, or analytics.
  • Invoice Processing: Automating the extraction of line items, totals, and vendor information from invoices.
  • Data Entry Automation: Reducing manual data entry by transforming OCR results into usable JSON formats.

Training Data

The model was fine-tuned using the mychen76/invoices-and-receipts_ocr_v1 dataset, which is tailored for OCR-to-JSON conversion tasks.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p