rishanthrajendhran/VeriFastScore

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:May 19, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

VeriFastScore by rishanthrajendhran is an 8 billion parameter LLaMA 3.1 Instruct fine-tuned model for factuality evaluation of long-form LLM outputs. It jointly extracts and verifies factual claims in a single pass, offering a faster and more cost-effective alternative to pipeline-based evaluators like VeriScore. This model excels at providing interpretable factuality scores by classifying claims as Supported or Unsupported based on provided evidence. It is primarily designed for use in evaluation pipelines, dataset filtering, and system-level benchmarking of LLM factuality.

Loading preview...

Overview

VeriFastScore is an 8 billion parameter LLaMA 3.1 Instruct model, fine-tuned by NGRAM at UMD and Lambda Labs, specifically for evaluating the factuality of long-form LLM generated text. Unlike multi-step pipeline evaluators, this model performs joint claim extraction and verification in a single inference pass, significantly reducing latency and computational cost while maintaining high agreement with more expensive methods like VeriScore.

Key Capabilities

  • Joint Claim Extraction and Verification: Extracts fine-grained, verifiable claims from long-form responses and simultaneously labels them as 'Supported' or 'Unsupported' based on provided evidence.
  • Reduced Latency and Cost: Designed to offer a faster and more economical solution for factuality assessment compared to traditional pipeline-based approaches.
  • High Correlation with Baselines: Achieves strong Pearson correlation (0.86 with claim-level evidence, 0.80 with sentence-level evidence) with VeriScore, a robust multi-step baseline.
  • System-Level Benchmarking: Demonstrates a system-level Pearson correlation of 0.94 with VeriScore for model rankings, providing a 6.6x speedup (9.9x excluding retrieval).
  • Input Flexibility: Takes a generated long-form response and a consolidated set of retrieved evidence sentences as input.

Good For

  • Factuality Evaluation Pipelines: Ideal for integrating into automated evaluation systems, such as those used for RLHF supervision.
  • Dataset Filtering: Can be used to filter datasets based on the factual accuracy of generated content.
  • LLM System Benchmarking: Suitable for benchmarking the factuality performance of various large language models at a system level.
  • Cost-Sensitive Applications: Provides a more efficient alternative for large-scale factuality assessment where cost and speed are critical factors.