verbal-calibrate: Confidence-Aware QA for Adaptive RAG

verbal-calibrate is an 8 billion parameter model fine-tuned from meta-llama/Llama-3.1-8B-Instruct. Its core innovation lies in its ability to express calibrated verbal confidence alongside its answers to factual questions.

Key Capabilities & Features

Verbalized Confidence: Provides a decimal confidence score (0-1) with each answer, reflecting the model's uncertainty.
Adaptive Retrieval Gating: Designed for adaptive RAG, where a low confidence score (e.g., < 0.5) can trigger external retrieval (like BM25) for a second-pass generation.
Step-by-Step Reasoning: Answers factual questions by first reasoning through the problem before stating the final answer and confidence.
Targeted Training: Supervised fine-tuning on multi-hop QA datasets (HotpotQA, MuSiQue, 2WikiMultiHopQA) and open-domain QA (NQ, TriviaQA), followed by calibration to align expressed confidence with empirical accuracy.

Performance Highlights

Evaluation across various QA datasets shows varying performance and trigger rates, indicating its ability to selectively request retrieval. For instance, on TriviaQA, it achieved an EM of 53.2 and F1 of 62.5 with a 28.8% trigger rate, while on MuSiQue, it had an EM of 11.8 and F1 of 18.8 with a 76.8% trigger rate.

Ideal Use Cases

Adaptive RAG Pipelines: Dynamically decide when to perform retrieval based on the model's self-assessed confidence.
Confidence-Aware Factual QA: Applications requiring not just an answer, but also an indication of the answer's reliability.
Uncertainty Calibration Research: A valuable tool for studying and improving uncertainty quantification in LLMs.

Overview

verbal-calibrate: Confidence-Aware QA for Adaptive RAG

Key Capabilities & Features

Performance Highlights

Ideal Use Cases

Full Model Card (README)