SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language... (SpeechT5)Platform26/100 via “fine-tuning on downstream speech tasks with minimal labeled data”
* ⭐ 06/2022: [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing (WavLM)](https://ieeexplore.ieee.org/abstract/document/9814838)
Unique: Enables efficient fine-tuning across diverse speech tasks (ASR, TTS, translation, voice conversion, enhancement, speaker ID) from a single pre-trained model, leveraging cross-modal pre-training to reduce task-specific labeled data requirements. The unified architecture allows parameter sharing across tasks.
vs others: Single pre-trained model can be fine-tuned for multiple speech tasks compared to training separate task-specific models, reducing overall labeled data requirements and model complexity, though per-task performance may be lower than specialized models.