CoSineVerifier-Tool-4B: Tool-Augmented Answer Verification
CoSineVerifier-Tool-4B is a compact, 4-billion parameter model specifically designed for verifying answers to computation-oriented scientific questions. Unlike traditional verifiers, it augments its reasoning with external tools to ensure precision in complex calculations and unit conversions.
Key Capabilities
- Tool-Augmented Verification: Integrates a Python interpreter for validating algebraic steps, algorithmic logic, and data operations, and a Scientific Unit Converter for normalizing and verifying unit conversions (e.g., km/h to m/s).
- High Accuracy in Scientific Scenarios: Evaluates multi-step reasoning in math, physics, chemistry, biology, and logical reasoning, where correctness depends on precise intermediate calculations.
- Concise and Low-Latency: Designed for real-time evaluation and large-scale batch processing with \u2264100 output tokens per verdict.
- Broad Applicability: Supports short-answer and multiple-choice formats, handling both brief and long-form responses across various scientific domains.
Performance Highlights
CoSineVerifier-Tool-4B achieves state-of-the-art results on key benchmarks:
- VerifyBench: 96.6% accuracy (91.9% on Hard subset).
- SCI-VerifyBench: 89.7% accuracy.
- Efficiency: Averages 95.3 output tokens per verdict, significantly lower than other CoT verifiers.
It also demonstrates clear improvements in RLVR tasks compared to other verification methods, as shown in experiments on competition-math problems.
Limitations
Currently, CoSineVerifier-Tool-4B invokes external tools in approximately 10% of cases and may still struggle with the most challenging verification items. Future improvements aim to increase tool-use coverage and develop a unified external-tool suite.