KE-SemanticVAD: Semantic Voice Activity Detection for Full-Duplex Dialogue
KE-SemanticVAD is a specialized model, fine-tuned from the Qwen2.5-0.5B-Instruct architecture, designed for semantic-level Voice Activity Detection (VAD) within full-duplex voice dialogue systems. Unlike traditional VAD that merely detects speech presence, this model focuses on interpreting the intent behind human utterances in real-time.
Key Capabilities
- Agent-Speaking Analysis: When an agent is speaking, the model analyzes human input to classify it as either an
<打断> (interruption) or <附和> (affirmation).- Interruption (
<打断>): Detected when a human attempts to seize conversational control, introduces new concepts, asks questions, raises objections, or shifts topics. - Affirmation (
<附和>): Identified when a human agrees with the agent, uses minimal feedback (e.g., "uh-huh"), repeats agent's words, or expresses simple confirmation.
- Human-Speaking Analysis: When a human is speaking, the model determines if their utterance is
<完成> (complete) or <未完> (incomplete).- Complete (
<完成>): Indicated by semantically full statements, clear questions/requests/conclusions, or explicit ending markers. - Incomplete (
<未完>): Suggested by trailing conjunctions (e.g., "and also," "but"), or filler words (e.g., "uhm," "hmm").
Performance Highlights
The model exhibits strong performance on its test set for semantic classification:
<打断>: 98.07% accuracy<附和>: 98.12% accuracy<完成>: 92.73% accuracy<未完>: 99.91% accuracy
Good For
- Real-time human-computer interaction systems: Enhancing the naturalness and responsiveness of conversational AI.
- Dialogue management: Providing crucial signals for turn-taking and understanding user intent beyond simple speech detection.
- Voice assistants and chatbots: Enabling more sophisticated and context-aware responses based on user behavior during dialogue.