Self Supervised Acoustic Representation Learning Without Labeled Data

1

wav2vec2-base-960hModel51/100

via “acoustic-feature-extraction-with-learned-representations”

automatic-speech-recognition model by undefined. 12,10,723 downloads.

Unique: Learns acoustic representations through contrastive learning on unlabeled audio rather than supervised phonetic labels — the model discovers phonetically-relevant features by predicting quantized codewords from nearby context, producing embeddings that generalize better to out-of-domain audio than supervised baselines

vs others: Produces more linguistically-informed embeddings than MFCC or mel-spectrogram features because the transformer encoder captures long-range dependencies, enabling better performance on downstream tasks like speaker verification (EER 2.1% vs 3.5% for MFCC-based systems)

2

w2v-bert-2.0Model50/100

via “self-supervised acoustic representation learning without labeled data”

feature-extraction model by undefined. 33,41,362 downloads.

Unique: Combines wav2vec2's contrastive learning (predicting masked frames from context) with BERT's masked language modeling on speech, creating a dual-objective pretraining approach that learns both acoustic and contextual patterns without labels — unlike supervised models requiring phoneme or speaker annotations

vs others: Eliminates annotation requirements compared to supervised acoustic models, while providing better generalization than single-objective self-supervised approaches (wav2vec2 alone) due to dual pretraining objectives

3

wav2vec2-large-xlsr-53-japaneseModel49/100

via “audio-feature-extraction-with-learned-representations”

automatic-speech-recognition model by undefined. 10,07,776 downloads.

Unique: Provides contextualized, time-aligned embeddings via transformer self-attention rather than static frame-level features, capturing long-range acoustic dependencies. The quantization bottleneck (used during pretraining) forces the model to learn discrete acoustic units, resulting in more interpretable and robust representations than continuous feature extraction.

vs others: Produces richer, context-aware embeddings than traditional MFCC or spectrogram-based features, and is more efficient than extracting features from larger models like Whisper while maintaining competitive quality for Japanese audio.

4

AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)Product24/100

via “transcript-free audio generation without annotation requirements”

* ⭐ 09/2022: [AudioGen: Textually Guided Audio Generation (AudioGen)](https://arxiv.org/abs/2209.15352)

Unique: Eliminates transcript and annotation requirements by learning directly from raw audio, using self-supervised pre-training (masked language modeling) to discover linguistic and acoustic structure without explicit supervision. This is a fundamental architectural choice that differs from text-to-speech and phoneme-based approaches.

vs others: Scales to unlabeled audio corpora that would be prohibitively expensive to transcribe, and avoids transcription errors that degrade text-to-speech quality, but sacrifices explicit content control that text-based systems provide.

5

mSLAM: Massively multilingual joint pre-training for speech and text (mSLAM)Product24/100

via “multilingual speech representation learning with contrastive objectives”

* ⭐ 02/2022: [ADD 2022: the First Audio Deep Synthesis Detection Challenge (ADD)](https://arxiv.org/abs/2202.08433)

Unique: Applies contrastive learning across 143+ languages simultaneously in a single model, learning universal speech representations without language-specific supervision, whereas prior work (wav2vec 2.0, HuBERT) typically trained on single languages or required language labels

vs others: Produces more language-agnostic representations than language-specific models, enabling better zero-shot transfer to new languages, and avoids the need for language identification by learning features that are inherently language-independent

6

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for ASR (BigSSL)Product23/100

via “self-training with pseudo-labeling for unlabeled audio”

* ⭐ 08/2022: [MuLan: A Joint Embedding of Music Audio and Natural Language (MuLan)](https://arxiv.org/abs/2208.12415)

Unique: Integrates pseudo-labeling as middle stage between SSL pre-training and supervised fine-tuning in three-stage pipeline; specific pseudo-label generation and filtering mechanisms not disclosed, but represents systematic approach to leveraging unlabeled data in semi-supervised ASR

vs others: More systematic than ad-hoc pseudo-labeling by grounding in pre-trained representations; effectiveness vs alternatives depends on undisclosed pseudo-label quality control mechanisms

7

SynthetaicProduct

via “self-supervised-model-training”

Top Matches

Also Known As

Company