ZzWater/viitor-voice-mix

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Dec 14, 2024License:cc-by-nc-sa-4.0Architecture:Transformer0.0K Open Weights Warm

ZzWater/viitor-voice-mix is a 0.5 billion parameter LLM-based Text-to-Speech (TTS) engine developed by ZzWater. This lightweight model is designed for real-time streaming output with low latency, supporting both Chinese and English languages. It features zero-shot voice cloning capabilities and offers over 300 voice options with flexible speech rate adjustment, making it suitable for diverse deployment needs from servers to mobile devices.

Loading preview...

ViiTor-Voice: A Lightweight, Real-time LLM-based TTS Engine

ViiTor-Voice by ZzWater is a 0.5 billion parameter Text-to-Speech (TTS) model engineered for efficiency and low-latency performance. It supports both Chinese and English languages and offers advanced features like zero-shot voice cloning, allowing for rapid voice replication from minimal samples. The model's design prioritizes computational resource optimization, making it deployable across various environments, including mobile and edge devices.

Key Capabilities

  • Lightweight Design: With only 0.5B parameters, it's highly efficient and compatible with most LLM inference engines, suitable for diverse deployment scenarios.
  • Real-time Streaming Output: Achieves an industry-leading first-frame latency of 200ms on Tesla T4, providing instant feedback for interactive applications.
  • Rich Voice Library: Provides over 300 distinct voice options to match various content requirements and preferences.
  • Flexible Speech Rate Adjustment: Allows natural variations in speech rate for enhanced emotional depth or efficient information delivery.
  • Zero-shot Voice Cloning: Supports cloning based on minimal voice samples, enhancing personalization.

Good For

  • Applications requiring low-latency, real-time speech generation.
  • Deployments on resource-constrained devices like mobile phones or edge computing environments.
  • Projects needing diverse voice options and flexible speech rate control.
  • Use cases benefiting from zero-shot voice cloning for personalized audio output.