MayankLad31/invoice_schema

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 3, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

MayankLad31/invoice_schema is a fine-tuned Qwen2.5-0.8B model optimized for extracting structured JSON data from OCR text based on user-defined schemas. This model specializes in invoice processing, enabling users to define custom JSON structures for fields like dates, invoice IDs, and line items. It is designed for local deployment and integration with OCR tools to automate data extraction from documents.

Loading preview...

Overview

MayankLad31/invoice_schema is a specialized model, fine-tuned from Qwen2.5-0.8B, designed for structured data extraction from document images. Its primary function is to parse OCR (Optical Character Recognition) text and output the information into a user-defined JSON schema.

Key Capabilities

  • Schema-driven Extraction: Users can specify any JSON schema, and the model will attempt to extract corresponding data from the input text.
  • Invoice Processing: Specifically demonstrated for extracting details like invoice dates, IDs, billing information, and itemized lists from invoice images.
  • Local Deployment: Provided in GGUF format (qwen_finetune.Q8_0.gguf and qwen_finetune.F16-mmproj.gguf) for efficient local execution on CPU.
  • Integration with OCR: Designed to work in conjunction with OCR tools (like PaddleOCR) to convert image-based documents into text for processing.

Good For

  • Automating data entry from invoices and other structured documents.
  • Developers needing a lightweight, locally deployable solution for custom schema-based information extraction.
  • Use cases requiring the conversion of unstructured text (from OCR) into structured JSON formats for further processing or database storage.