Mel Spectrogram Generation With Duration And Pitch Prediction

1

higgs-audio-v2-generation-3B-baseModel48/100

via “mel-spectrogram generation with duration and pitch prediction”

text-to-speech model by undefined. 2,95,715 downloads.

Unique: Uses auxiliary prediction heads for duration and pitch jointly trained with the main decoder, enabling implicit prosody learning without explicit phoneme-frame alignment annotations, and allows inference-time prosody scaling by modulating predicted values

vs others: More flexible than fixed-duration TTS (e.g., Glow-TTS) and avoids the alignment brittleness of older Tacotron models by learning duration distributions end-to-end; more controllable than end-to-end models (Glow-TTS, FastSpeech) that don't expose pitch/duration predictions

2

speecht5_ttsModel43/100

via “non-autoregressive mel-spectrogram generation with duration prediction”

text-to-speech model by undefined. 1,49,878 downloads.

Unique: Combines non-autoregressive parallel generation with explicit duration prediction module, enabling both low-latency synthesis and controllable speech rate without retraining — unlike autoregressive models that generate frame-by-frame and cannot easily adjust timing

vs others: Faster inference than Tacotron2 or Transformer TTS while maintaining quality through duration modeling, and more controllable than FastSpeech2 because it includes speaker conditioning for multi-speaker synthesis

3

MeloTTS-JapaneseModel41/100

via “phoneme-level duration and pitch prediction with linguistic features”

text-to-speech model by undefined. 2,10,673 downloads.

Unique: Implements duration and pitch prediction as separate feed-forward networks operating on linguistic embeddings from the text encoder, enabling joint optimization with the mel-spectrogram decoder via multi-task learning. The pitch predictor generates frame-level F0 values that are directly supervised during training, allowing the model to learn Japanese pitch accent patterns from data rather than relying on rule-based accent assignment.

vs others: More flexible than rule-based prosody systems (e.g., Festival, MARY TTS) by learning prosody patterns from data; faster than sequence-to-sequence pitch prediction models (feed-forward vs. RNN/Transformer) while maintaining comparable accuracy; enables fine-grained prosody control that commercial APIs typically don't expose.

Top Matches

Also Known As

Company