CoSineVerifier-Tool-4B: Tool-Augmented Answer Verification

CoSineVerifier-Tool-4B is a compact, 4-billion parameter model specifically designed for verifying answers to computation-oriented scientific questions. Unlike traditional verifiers, it augments its reasoning with external tools to ensure precision in complex calculations and unit conversions.

Key Capabilities

Tool-Augmented Verification: Integrates a Python interpreter for validating algebraic steps, algorithmic logic, and data operations, and a Scientific Unit Converter for normalizing and verifying unit conversions (e.g., km/h to m/s).
High Accuracy in Scientific Scenarios: Evaluates multi-step reasoning in math, physics, chemistry, biology, and logical reasoning, where correctness depends on precise intermediate calculations.
Concise and Low-Latency: Designed for real-time evaluation and large-scale batch processing with \u2264100 output tokens per verdict.
Broad Applicability: Supports short-answer and multiple-choice formats, handling both brief and long-form responses across various scientific domains.

Performance Highlights

CoSineVerifier-Tool-4B achieves state-of-the-art results on key benchmarks:

VerifyBench: 96.6% accuracy (91.9% on Hard subset).
SCI-VerifyBench: 89.7% accuracy.
Efficiency: Averages 95.3 output tokens per verdict, significantly lower than other CoT verifiers.

It also demonstrates clear improvements in RLVR tasks compared to other verification methods, as shown in experiments on competition-math problems.

Limitations

Currently, CoSineVerifier-Tool-4B invokes external tools in approximately 10% of cases and may still struggle with the most challenging verification items. Future improvements aim to increase tool-use coverage and develop a unified external-tool suite.

Overview

CoSineVerifier-Tool-4B: Tool-Augmented Answer Verification

Key Capabilities

Performance Highlights

Limitations

Full Model Card (README)