MYZY-AI/Muyan-TTS-SFT
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:3.2BQuant:BF16Ctx Length:32kPublished:Apr 22, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

MYZY-AI's Muyan-TTS-SFT is a 3.2 billion parameter trainable Text-to-Speech (TTS) model specifically designed for podcast applications. Pre-trained on over 100,000 hours of podcast audio, it offers high-quality zero-shot TTS synthesis and supports speaker adaptation with minimal target speech. This model excels at generating customizable voices, making it suitable for personalized audio content creation.

Loading preview...

Muyan-TTS-SFT: Trainable TTS for Podcasts

Muyan-TTS-SFT is a 3.2 billion parameter Text-to-Speech (TTS) model developed by MYZY-AI, optimized for podcast production within a budget-conscious framework. It leverages extensive pre-training on over 100,000 hours of podcast audio data to deliver high-quality voice generation.

Key Capabilities

  • Zero-Shot TTS Synthesis: Generates high-quality speech from text without prior speaker-specific training, using a reference audio.
  • Speaker Adaptation: Supports customization to individual voices with as little as "dozens of minutes" of target speech, enabling fine-tuning for specific speakers.
  • SFT Model for Specific Voices: The sft model type is trained on a specific voice (e.g., Claire's voice in the examples) for consistent output, while the base model allows for arbitrary speaker prompts.
  • API Support: Includes an API for easy integration and deployment, with vLLM acceleration enabled by default for efficient inference.

Good For

  • Podcast Production: Ideal for creating and customizing voices for podcast content.
  • Personalized Audio Content: Generating speech in a specific speaker's voice with minimal adaptation data.
  • Developers: Provides a trainable framework for building custom TTS solutions, with clear installation and quickstart guides.