anyreach-ai/semantic-turn-taking
The anyreach-ai/semantic-turn-taking model, developed by Shangeth Rajaa, is a fine-tuned Qwen2.5-0.5B-Instruct model (494M parameters) designed for predicting turn-taking actions in conversational AI. Unlike acoustic methods, it leverages the semantic content of conversations to determine when a voice agent should speak, listen, or continue. This model predicts one of four specific actions: start_speaking, continue_listening, start_listening, or continue_speaking, making it ideal for building highly responsive and natural voice AI agents.
Loading preview...
Semantic Turn-Taking Model Overview
The anyreach-ai/semantic-turn-taking model is a specialized language model, fine-tuned from Qwen2.5-0.5B-Instruct, designed to predict optimal turn-taking actions for voice AI agents in real-time conversations. Its core innovation lies in using the semantic content of the dialogue, rather than just acoustic cues like silence detection, to make these predictions.
Key Capabilities
- Semantic-based Turn Prediction: Determines agent actions based on the meaning and flow of the conversation.
- Four Action Classes: Predicts one of four distinct actions:
start_speaking: User has finished, agent should respond.continue_listening: User is still speaking.start_listening: User interrupted the agent, agent should stop talking.continue_speaking: User provided a backchannel, agent should continue speaking.
- Efficient Inference: Offers low latency on both GPU (26-34 ms) and CPU (128-191 ms for ONNX q8) for single examples.
- Benchmarked Performance: Achieves up to 91.82% accuracy on binary (EOU vs Not-EOU) turn-taking prediction on the TEN dataset.
- Flexible Deployment: Available in PyTorch (fp16/fp32) and ONNX (q8 quantized) formats.
Good For
- Voice AI Agents: Enhancing the naturalness and responsiveness of conversational AI systems.
- Real-time Interaction: Applications requiring precise, context-aware turn-taking decisions in live dialogue.
- Dialogue Management: Integrating semantic understanding into the flow control of spoken interactions.