Turn Detector Qwen3-4B: Real-time Turn-End Detection
This model is a fine-tuned Qwen3-4B language model, specifically optimized for real-time turn-end detection in multilingual call center conversations. Its primary function is to predict the probability that a speaker has finished their turn (P(<|im_end|>)), enabling low-latency voice agent pipelines (e.g., LiveKit) to determine the appropriate moment to respond.
Key Capabilities
- Real-time Turn Detection: Outputs a probability score for turn completion, with
P(im_end) > 0.5 indicating a complete turn. - Multilingual Support: Evaluated across 12 language pairs, demonstrating robust performance in diverse linguistic contexts.
- High Precision: Achieves 100% precision in identifying turn completions on its evaluation dataset, minimizing false positives.
- Optimized for Voice Agents: Designed to integrate into voice agent systems where timely and accurate turn-taking is crucial.
Performance Highlights
On a synthetic test set of 238 samples (119 positive, 119 negative) across 12 language pairs, the model achieved:
- Accuracy: 88.24%
- Precision: 100.00%
- Recall: 76.47%
- F1 Score: 86.67%
Notably, it showed 100% accuracy for identifying negative cases (speaker still talking) and strong performance across various language pairs, including Chinese-Tamil and Malay-English.
Training Details
The model was trained on positive samples (complete conversations ending with <|im_end|>) using a base Qwen/Qwen3-4B model. Training utilized Liger Fused Linear Cross Entropy loss, FA4 attention, and bfloat16 precision, with a block size of 8192 and a constant learning rate of 2e-5 over 1 epoch. Training data included datasets like Call Center Language Switching and Malaysian Multiturn Chat Assistant.