NVIDIA NeMoFramework44/100
via “text-to-speech synthesis with phoneme-to-grapheme conversion and prosody control”
NVIDIA's framework for scalable generative AI training.
Unique: Decouples duration/pitch prediction (FastPitch) from waveform generation (HiFi-GAN vocoder), allowing independent optimization of linguistic and acoustic modeling. G2P modules are pluggable and language-aware, with support for phoneme-level control via markup (e.g., `[p ə 'l ɪ s]` for 'police'). Vocoder fine-tuning uses speaker adaptation layers rather than full retraining, reducing data requirements from 1000+ to 10-30 utterances.
vs others: More granular prosody control and speaker adaptation than Tacotron2-based systems, but less naturalness than Glow-TTS or recent diffusion-based TTS models; stronger multilingual support than Glow-TTS but requires language-specific G2P models.