Nanbeige/CoSineVerifier-Tool-4B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Oct 22, 2025License:mitArchitecture:Transformer0.0K Open Weights Warm

Nanbeige/CoSineVerifier-Tool-4B is a 4-billion parameter, tool-augmented verifier designed for computation-oriented scientific question answer verification. It integrates a Python interpreter and a scientific unit converter to accurately evaluate multi-step reasoning in math and science, normalizing answers for consistent numeric checks. This model excels at verifying algebraic equivalence and physical-constant alignment with high accuracy and low latency, achieving state-of-the-art results on VerifyBench and SCI-VerifyBench.

Loading preview...

CoSineVerifier-Tool-4B: Tool-Augmented Answer Verification

CoSineVerifier-Tool-4B is a compact, 4-billion parameter model specifically designed for verifying answers to computation-oriented scientific questions. Unlike traditional verifiers, it augments its reasoning with external tools to ensure precision in complex calculations and unit conversions.

Key Capabilities

  • Tool-Augmented Verification: Integrates a Python interpreter for validating algebraic steps, algorithmic logic, and data operations, and a Scientific Unit Converter for normalizing and verifying unit conversions (e.g., km/h to m/s).
  • High Accuracy in Scientific Scenarios: Evaluates multi-step reasoning in math, physics, chemistry, biology, and logical reasoning, where correctness depends on precise intermediate calculations.
  • Concise and Low-Latency: Designed for real-time evaluation and large-scale batch processing with \u2264100 output tokens per verdict.
  • Broad Applicability: Supports short-answer and multiple-choice formats, handling both brief and long-form responses across various scientific domains.

Performance Highlights

CoSineVerifier-Tool-4B achieves state-of-the-art results on key benchmarks:

  • VerifyBench: 96.6% accuracy (91.9% on Hard subset).
  • SCI-VerifyBench: 89.7% accuracy.
  • Efficiency: Averages 95.3 output tokens per verdict, significantly lower than other CoT verifiers.

It also demonstrates clear improvements in RLVR tasks compared to other verification methods, as shown in experiments on competition-math problems.

Limitations

Currently, CoSineVerifier-Tool-4B invokes external tools in approximately 10% of cases and may still struggle with the most challenging verification items. Future improvements aim to increase tool-use coverage and develop a unified external-tool suite.