Natural Language Inference Classification With Disentangled Attention

1

deberta-v3-baseModel49/100

via “attention-visualization-and-interpretability”

fill-mask model by undefined. 24,63,712 downloads.

Unique: Disentangled attention architecture produces three distinct attention weight matrices per head (content-content, content-position, position-position) instead of a single unified matrix, enabling more fine-grained analysis of how the model separates semantic and positional reasoning.

vs others: Provides richer interpretability signals than standard BERT attention by explicitly separating content and position interactions, allowing researchers to identify whether model failures stem from semantic confusion or positional misunderstanding.

2

mDeBERTa-v3-base-xnli-multilingual-nli-2mil7Model48/100

via “cross-lingual-natural-language-inference”

zero-shot-classification model by undefined. 3,03,704 downloads.

Unique: Trained on XNLI's 2.7M examples across 15 languages with DeBERTa-v3's disentangled attention, which explicitly separates content and position information in attention heads. This architectural choice allows the model to learn language-agnostic entailment patterns that transfer across typologically distant languages (e.g., English to Japanese) better than standard BERT-style models.

vs others: Achieves 85%+ accuracy on XNLI benchmark vs 75-80% for XLM-RoBERTa, and unlike task-specific models (e.g., RoBERTa-large-mnli), maintains strong cross-lingual transfer without requiring language-specific fine-tuning.

3

distilbart-cnn-12-6Model48/100

via “interpretability and attention visualization”

summarization model by undefined. 11,11,635 downloads.

Unique: Exposes both encoder self-attention and decoder cross-attention weights, enabling analysis of both input understanding and generation alignment; supports layer-wise hidden state extraction for probing studies without requiring model modification

vs others: More granular than LIME/SHAP (which treat model as black box) and more efficient than gradient-based attribution methods (which require backpropagation), while providing direct access to model internals without post-hoc approximation

4

mdeberta-v3-baseModel47/100

via “multilingual masked token prediction with disentangled attention”

fill-mask model by undefined. 14,52,378 downloads.

Unique: Uses disentangled attention mechanism (separate content and position representations) instead of standard multi-head attention, enabling more efficient position-aware predictions and reducing computational overhead by ~15% vs BERT-style models while maintaining or improving accuracy across 10+ languages

vs others: Outperforms mBERT and XLM-RoBERTa on multilingual masked token prediction benchmarks due to disentangled attention architecture, while maintaining smaller model size (110M parameters vs 355M for XLM-RoBERTa-large)

5

mDeBERTa-v3-base-mnli-xnliModel46/100

via “efficient inference via deberta-v3 architecture with disentangled attention”

zero-shot-classification model by undefined. 2,28,003 downloads.

Unique: DeBERTa-v3's disentangled attention mechanism reduces attention complexity by computing content-to-content and position-to-position attention separately, lowering computational cost compared to standard multi-head attention. Combined with ONNX and SafeTensors export, enables optimized inference across heterogeneous hardware.

vs others: Achieves 2-3x faster inference than standard BERT-base on CPU due to disentangled attention, and supports ONNX quantization for additional 4-8x speedup with minimal accuracy loss, outperforming DistilBERT on accuracy-latency tradeoff for zero-shot classification.

6

DeBERTa-v3-large-mnli-fever-anli-ling-wanliModel46/100

via “deberta-v3-disentangled-attention-encoding”

zero-shot-classification model by undefined. 2,25,548 downloads.

Unique: DeBERTa-v3's disentangled attention separates content-to-content and content-to-position attention heads, enabling more expressive representations than standard Transformer attention; combined with relative position bias and ELECTRA-style pretraining, achieves SOTA on GLUE/SuperGLUE benchmarks

vs others: Produces richer semantic representations than BERT-large or RoBERTa-large due to architectural innovations; 3-5% accuracy improvement on NLI tasks vs. RoBERTa-large with similar inference cost

7

deberta-v3-base-tasksource-nliModel44/100

via “deberta-v3 disentangled attention-based text encoding”

zero-shot-classification model by undefined. 1,17,720 downloads.

Unique: Uses DeBERTa-v3's disentangled attention which factorizes attention into separate content-to-content and content-to-position streams, enabling more efficient and interpretable attention patterns compared to standard multi-head attention. This architectural choice improves both accuracy and computational efficiency.

vs others: Disentangled attention in DeBERTa-v3 achieves 2-5% better accuracy than standard BERT-style attention on classification tasks while maintaining similar inference latency, due to more efficient representation of positional and semantic information.

8

nli-deberta-v3-baseModel44/100

via “zero-shot natural language inference classification”

zero-shot-classification model by undefined. 1,87,439 downloads.

Unique: Uses cross-encoder architecture (joint premise-hypothesis processing) rather than bi-encoder siamese networks, enabling direct entailment classification without embedding space constraints. DeBERTa-v3-base's disentangled attention mechanism provides superior performance on NLI tasks compared to BERT-based alternatives, with 2-3% higher accuracy on SNLI/MultiNLI benchmarks while maintaining similar model size.

vs others: Outperforms BERT-based NLI models (e.g., bert-base-uncased fine-tuned on SNLI) by 2-4% accuracy due to DeBERTa's disentangled attention, and provides faster inference than larger models (RoBERTa-large) while maintaining competitive zero-shot generalization across domains.

9

nli-deberta-v3-smallModel44/100

via “zero-shot natural language inference classification”

zero-shot-classification model by undefined. 2,47,798 downloads.

Unique: Uses DeBERTa-v3-small's disentangled attention mechanism (separating content and position representations) combined with cross-encoder joint encoding, achieving higher accuracy on NLI than standard BERT-based classifiers while maintaining 40% smaller model size than DeBERTa-base variants

vs others: Outperforms bi-encoder zero-shot classifiers (e.g., CLIP-based approaches) on NLI-specific tasks due to joint premise-hypothesis encoding, while being 10x faster than large language models for the same task and requiring no API calls

10

deberta-xlarge-mnliModel43/100

text-classification model by undefined. 5,13,435 downloads.

Unique: Uses disentangled attention mechanism (separate content and position embeddings in each transformer layer) instead of standard multi-head attention, enabling more efficient modeling of long-range dependencies and structural relationships. This architectural innovation allows the model to achieve SOTA on MNLI (90.2% accuracy) with fewer parameters than RoBERTa-large while maintaining interpretability of attention patterns.

vs others: Outperforms RoBERTa-large and ELECTRA-large on MNLI benchmark (90.2% vs 88.2% and 88.8%) while using disentangled attention for better interpretability; faster inference than BERT-large due to more efficient attention computation despite larger parameter count.

11

DeBERTa-v3-base-mnli-fever-anliModel43/100

via “multi-dataset natural language inference with cross-domain robustness”

zero-shot-classification model by undefined. 64,968 downloads.

Unique: Combines three complementary NLI datasets (MNLI for general inference, FEVER for fact-checking, ANLI for adversarial robustness) with DeBERTa-v3's disentangled attention to create a model that generalizes across domains and resists adversarial examples; adversarial training on ANLI specifically targets common NLI failure modes

vs others: More robust to adversarial and out-of-domain examples than single-dataset NLI models (e.g., MNLI-only BERT) due to multi-dataset training; smaller and faster than T5-based NLI models while maintaining competitive accuracy on FEVER and ANLI benchmarks

12

mdeberta-v3-base-squad2Model42/100

via “efficient transformer inference with disentangled attention”

question-answering model by undefined. 1,90,899 downloads.

Unique: DeBERTa-v3 separates content and position attention into distinct heads rather than mixing them in standard multi-head attention, reducing interference and enabling more efficient computation; this architectural choice improves both speed and accuracy simultaneously

vs others: 40% fewer parameters than BERT-large with 2-3% higher SQuAD 2.0 F1, and 3-5x faster CPU inference than standard BERT due to disentangled attention reducing redundant computation across heads

13

nli-deberta-v3-largeModel42/100

via “zero-shot natural language inference classification”

zero-shot-classification model by undefined. 80,926 downloads.

Unique: Uses DeBERTa v3-large's disentangled attention mechanism (which separates content and position representations) combined with cross-encoder architecture that jointly encodes premise-hypothesis pairs, enabling more nuanced semantic relationship detection than bi-encoder alternatives that embed sentences independently

vs others: Outperforms BERT-based NLI models and general-purpose zero-shot classifiers on entailment tasks due to DeBERTa's superior architectural design and training on 900K+ NLI examples; faster than ensemble approaches while maintaining competitive accuracy

14

DeBERTa-v3-xsmall-mnli-fever-anli-ling-binaryModel38/100

via “multilingual natural language inference with english-primary training”

zero-shot-classification model by undefined. 33,943 downloads.

Unique: Combines four diverse NLI training datasets (MNLI for formal reasoning, FEVER for factual claims, ANLI for adversarial robustness, LingNLI for linguistic phenomena) into a single model checkpoint, leveraging DeBERTa-v3's disentangled attention to learn dataset-specific reasoning patterns while maintaining generalization; binary variant simplifies deployment for entailment-only use cases

vs others: Achieves higher accuracy on out-of-domain NLI benchmarks than RoBERTa-large-mnli and ELECTRA-large-discriminator while using 7x fewer parameters, and the multi-dataset training provides better robustness to adversarial examples and factual claims compared to single-dataset MNLI-only models

15

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (BERT)Model21/100

via “natural language inference with sentence-pair classification”

* 🏆 2020: [Language Models are Few-Shot Learners (GPT-3)](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)

Unique: Leverages the [CLS] token representation (pre-trained via NSP objective) for sentence-pair classification, creating a direct connection between pre-training and fine-tuning objectives; bidirectional context enables understanding of semantic relationships without explicit alignment or interaction mechanisms

vs others: Achieves +4.6 percentage point improvement on MultiNLI compared to prior baselines by using bidirectional context and joint pre-training (MLM + NSP), whereas prior approaches required task-specific interaction layers or attention mechanisms

Top Matches

Also Known As

Company