KE-Team/KE-SemanticVAD

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kLicense:apache-2.0Architecture:Transformer0.0K Open Weights Warm

KE-Team/KE-SemanticVAD is a semantic-level Voice Activity Detection (VAD) model fine-tuned from the Qwen2.5-0.5B-Instruct architecture. Designed for full-duplex voice dialogue systems, it analyzes human intent during conversations, classifying actions such as interruptions or affirmations when an agent is speaking, and determining utterance completion when a human is speaking. The model demonstrates high accuracy across these semantic VAD tasks, making it suitable for real-time human-computer interaction analysis.

Loading preview...

KE-SemanticVAD: Semantic Voice Activity Detection for Full-Duplex Dialogue

KE-SemanticVAD is a specialized model, fine-tuned from the Qwen2.5-0.5B-Instruct architecture, designed for semantic-level Voice Activity Detection (VAD) within full-duplex voice dialogue systems. Unlike traditional VAD that merely detects speech presence, this model focuses on interpreting the intent behind human utterances in real-time.

Key Capabilities

  • Agent-Speaking Analysis: When an agent is speaking, the model analyzes human input to classify it as either an <打断> (interruption) or <附和> (affirmation).
    • Interruption (<打断>): Detected when a human attempts to seize conversational control, introduces new concepts, asks questions, raises objections, or shifts topics.
    • Affirmation (<附和>): Identified when a human agrees with the agent, uses minimal feedback (e.g., "uh-huh"), repeats agent's words, or expresses simple confirmation.
  • Human-Speaking Analysis: When a human is speaking, the model determines if their utterance is <完成> (complete) or <未完> (incomplete).
    • Complete (<完成>): Indicated by semantically full statements, clear questions/requests/conclusions, or explicit ending markers.
    • Incomplete (<未完>): Suggested by trailing conjunctions (e.g., "and also," "but"), or filler words (e.g., "uhm," "hmm").

Performance Highlights

The model exhibits strong performance on its test set for semantic classification:

  • <打断>: 98.07% accuracy
  • <附和>: 98.12% accuracy
  • <完成>: 92.73% accuracy
  • <未完>: 99.91% accuracy

Good For

  • Real-time human-computer interaction systems: Enhancing the naturalness and responsiveness of conversational AI.
  • Dialogue management: Providing crucial signals for turn-taking and understanding user intent beyond simple speech detection.
  • Voice assistants and chatbots: Enabling more sophisticated and context-aware responses based on user behavior during dialogue.