KE-Team/KE-SemanticVAD
KE-Team/KE-SemanticVAD is a semantic-level Voice Activity Detection (VAD) model fine-tuned from the Qwen2.5-0.5B-Instruct architecture. Designed for full-duplex voice dialogue systems, it analyzes human intent during conversations, classifying actions such as interruptions or affirmations when an agent is speaking, and determining utterance completion when a human is speaking. The model demonstrates high accuracy across these semantic VAD tasks, making it suitable for real-time human-computer interaction analysis.
Loading preview...
KE-SemanticVAD: Semantic Voice Activity Detection for Full-Duplex Dialogue
KE-SemanticVAD is a specialized model, fine-tuned from the Qwen2.5-0.5B-Instruct architecture, designed for semantic-level Voice Activity Detection (VAD) within full-duplex voice dialogue systems. Unlike traditional VAD that merely detects speech presence, this model focuses on interpreting the intent behind human utterances in real-time.
Key Capabilities
- Agent-Speaking Analysis: When an agent is speaking, the model analyzes human input to classify it as either an
<打断>(interruption) or<附和>(affirmation).- Interruption (
<打断>): Detected when a human attempts to seize conversational control, introduces new concepts, asks questions, raises objections, or shifts topics. - Affirmation (
<附和>): Identified when a human agrees with the agent, uses minimal feedback (e.g., "uh-huh"), repeats agent's words, or expresses simple confirmation.
- Interruption (
- Human-Speaking Analysis: When a human is speaking, the model determines if their utterance is
<完成>(complete) or<未完>(incomplete).- Complete (
<完成>): Indicated by semantically full statements, clear questions/requests/conclusions, or explicit ending markers. - Incomplete (
<未完>): Suggested by trailing conjunctions (e.g., "and also," "but"), or filler words (e.g., "uhm," "hmm").
- Complete (
Performance Highlights
The model exhibits strong performance on its test set for semantic classification:
<打断>: 98.07% accuracy<附和>: 98.12% accuracy<完成>: 92.73% accuracy<未完>: 99.91% accuracy
Good For
- Real-time human-computer interaction systems: Enhancing the naturalness and responsiveness of conversational AI.
- Dialogue management: Providing crucial signals for turn-taking and understanding user intent beyond simple speech detection.
- Voice assistants and chatbots: Enabling more sophisticated and context-aware responses based on user behavior during dialogue.