Non Autoregressive Mel Spectrogram Generation With Duration Prediction

1

ChatTTSAgent53/100

via “mel spectrogram generation from discrete audio tokens”

A generative speech model for daily dialogue.

Unique: Uses a DVAE (Discrete Variational Autoencoder) rather than a simple lookup table or continuous decoder, enabling learned, high-quality reconstruction of spectrograms from discrete tokens. The DVAE is trained end-to-end with the audio codec, ensuring that discrete tokens capture all information needed for high-fidelity spectrogram reconstruction.

vs others: More flexible than fixed codebooks because the DVAE decoder learns to reconstruct spectrograms from tokens, enabling better quality and smoother transitions between tokens. More efficient than storing spectrograms directly because discrete tokens are more compact and enable better generalization across speakers and content.

2

chatterboxModel50/100

via “real-time mel-spectrogram generation with attention-based alignment”

text-to-speech model by undefined. 21,08,297 downloads.

Unique: Uses learned attention alignment rather than explicit duration prediction models, reducing model complexity and enabling end-to-end training without duration annotations. Attention weights are computed dynamically at inference time, allowing the model to adapt alignment to input length without retraining.

vs others: Simpler than duration-based models (e.g., FastSpeech) because it avoids explicit duration prediction, but potentially less controllable because speech rate and pause length cannot be adjusted per-token at inference time.

3

higgs-audio-v2-generation-3B-baseModel48/100

via “mel-spectrogram generation with duration and pitch prediction”

text-to-speech model by undefined. 2,95,715 downloads.

Unique: Uses auxiliary prediction heads for duration and pitch jointly trained with the main decoder, enabling implicit prosody learning without explicit phoneme-frame alignment annotations, and allows inference-time prosody scaling by modulating predicted values

vs others: More flexible than fixed-duration TTS (e.g., Glow-TTS) and avoids the alignment brittleness of older Tacotron models by learning duration distributions end-to-end; more controllable than end-to-end models (Glow-TTS, FastSpeech) that don't expose pitch/duration predictions

4

Kokoro-82M-bf16Model44/100

via “mel-spectrogram to waveform vocoding”

text-to-speech model by undefined. 4,69,583 downloads.

Unique: Uses a non-autoregressive vocoder (likely HiFi-GAN variant) that generates entire waveforms in a single forward pass, achieving 50-100x speedup compared to autoregressive alternatives like WaveNet. The vocoder is optimized for MLX inference, leveraging GPU acceleration to produce 22050 Hz audio at real-time or faster-than-real-time speeds.

vs others: Faster than WaveGlow or WaveNet vocoders while maintaining comparable audio quality; more efficient than traditional signal processing vocoders (WORLD, STRAIGHT) because neural vocoding requires no explicit pitch extraction or spectral envelope modeling.

5

speecht5_ttsModel43/100

via “non-autoregressive mel-spectrogram generation with duration prediction”

text-to-speech model by undefined. 1,49,878 downloads.

Unique: Combines non-autoregressive parallel generation with explicit duration prediction module, enabling both low-latency synthesis and controllable speech rate without retraining — unlike autoregressive models that generate frame-by-frame and cannot easily adjust timing

vs others: Faster inference than Tacotron2 or Transformer TTS while maintaining quality through duration modeling, and more controllable than FastSpeech2 because it includes speaker conditioning for multi-speaker synthesis

Top Matches

Also Known As

Company