KE-SemanticVAD: Semantic Voice Activity Detection for Full-Duplex Dialogue

KE-SemanticVAD is a specialized model, fine-tuned from the Qwen2.5-0.5B-Instruct architecture, designed for semantic-level Voice Activity Detection (VAD) within full-duplex voice dialogue systems. Unlike traditional VAD that merely detects speech presence, this model focuses on interpreting the intent behind human utterances in real-time.

Key Capabilities

Agent-Speaking Analysis: When an agent is speaking, the model analyzes human input to classify it as either an <打断> (interruption) or <附和> (affirmation).
- Interruption (<打断>): Detected when a human attempts to seize conversational control, introduces new concepts, asks questions, raises objections, or shifts topics.
- Affirmation (<附和>): Identified when a human agrees with the agent, uses minimal feedback (e.g., "uh-huh"), repeats agent's words, or expresses simple confirmation.
Human-Speaking Analysis: When a human is speaking, the model determines if their utterance is <完成> (complete) or <未完> (incomplete).
- Complete (<完成>): Indicated by semantically full statements, clear questions/requests/conclusions, or explicit ending markers.
- Incomplete (<未完>): Suggested by trailing conjunctions (e.g., "and also," "but"), or filler words (e.g., "uhm," "hmm").

Performance Highlights

The model exhibits strong performance on its test set for semantic classification:

<打断>: 98.07% accuracy
<附和>: 98.12% accuracy
<完成>: 92.73% accuracy
<未完>: 99.91% accuracy

Good For

Real-time human-computer interaction systems: Enhancing the naturalness and responsiveness of conversational AI.
Dialogue management: Providing crucial signals for turn-taking and understanding user intent beyond simple speech detection.
Voice assistants and chatbots: Enabling more sophisticated and context-aware responses based on user behavior during dialogue.

Overview

KE-SemanticVAD: Semantic Voice Activity Detection for Full-Duplex Dialogue

Key Capabilities

Performance Highlights

Good For

Full Model Card (README)