Overview
Nessie v5 is Arkova's specialized credential metadata extraction model, built upon the Meta Llama 3.1 8B Instruct base model. It is fine-tuned for structured data extraction from PII-stripped document text, leveraging a 32,768 token context length.
Key Capabilities
- Specialized Extraction: Designed to extract structured metadata from various credential types, including DEGREE, LICENSE, CERTIFICATE, FINANCIAL, LEGAL, and more.
- Domain-Specific Adapters: Incorporates LoRA adapters trained on extensive corpora for SEC filings (45K examples), Academic documents (45K examples), Legal texts (13K examples), and Regulatory documents (13K examples).
- Performance: Achieves a Weighted F1 score of 87.2% and a Macro F1 of 75.7% on its validation set for metadata extraction.
- PII-Stripped Processing: Intended for use with pre-processed text where personally identifiable information has been removed.
Good For
- Automated Credential Processing: Ideal for applications requiring the extraction of specific metadata fields from a wide range of credential documents.
- Legal and Financial Document Analysis: Particularly strong in domains like SEC filings, legal documents, and academic records due to its specialized training.
- Structured Data Generation: Useful for converting unstructured credential text into structured, queryable data formats.
Important Note
This model requires the use of its trained condensed prompt (~1.5K characters); using the full extraction prompt (58K characters) will result in 0% F1 due to a prompt template mismatch.