11-877: Advanced Topics in MultiModal Machine Learning (Fall 2022) - Carnegie Mellon University
Product
Capabilities11 decomposed
multimodal-fusion-architecture-instruction
Medium confidenceTeaches architectural patterns for combining visual, audio, and textual modalities through cross-modal attention mechanisms, transformer-based fusion layers, and late/early/hybrid fusion strategies. Covers implementation of joint embedding spaces where heterogeneous data types are projected into shared representational spaces, enabling downstream tasks like visual question answering and video understanding through coordinated feature alignment.
Structured curriculum from Carnegie Mellon's MultiComp Lab combining theoretical foundations with hands-on implementation of state-of-the-art fusion strategies (early fusion via concatenation, late fusion via score aggregation, hybrid attention-based fusion) with explicit coverage of alignment losses and contrastive learning objectives
More comprehensive than generic deep learning courses by focusing exclusively on multimodal-specific architectures and fusion patterns, with direct access to CMU researchers' latest work rather than textbook-only material
vision-language-model-design-instruction
Medium confidenceTeaches design patterns for vision-language models (VLMs) including CLIP-style contrastive learning, image-text matching objectives, and transformer-based architectures that align visual and textual representations. Covers implementation of dual-encoder systems with shared embedding spaces, training strategies using contrastive losses (InfoNCE), and inference patterns for zero-shot classification and image-text retrieval.
Provides structured breakdown of CLIP-style architectures with explicit coverage of dual-encoder design, contrastive loss formulation (InfoNCE with temperature scaling), and inference-time optimization patterns for efficient similarity computation across large image databases
Deeper technical treatment of vision-language alignment than general multimodal courses, with focus on the mathematical foundations of contrastive objectives and practical implementation details for production-scale systems
transformer-based-multimodal-architecture-instruction
Medium confidenceTeaches design patterns for transformer-based multimodal models including vision transformers (ViT) for image encoding, text transformers for language understanding, and cross-attention mechanisms that enable interaction between modalities. Covers architectural choices like shared vs separate token spaces, positional encoding strategies for different modalities, and training techniques (masked language modeling, masked image modeling, contrastive learning) adapted for multimodal transformers.
Detailed coverage of transformer-based multimodal architectures including vision transformer (ViT) design with patch embeddings, cross-attention mechanisms for modality interaction, and multimodal pre-training objectives (masked language modeling, masked image modeling, contrastive learning) adapted for transformer-based models
More focused on transformer-specific multimodal design patterns than general multimodal architecture courses, with emphasis on attention mechanisms and pre-training strategies specific to transformer models
video-understanding-temporal-modeling-instruction
Medium confidenceTeaches temporal modeling approaches for video understanding including 3D CNNs (C3D), two-stream networks (spatial + temporal pathways), and transformer-based video encoders. Covers how to capture motion patterns through optical flow, frame sampling strategies, and temporal attention mechanisms that learn which frames are semantically important for action recognition and video classification tasks.
Systematic coverage of temporal modeling paradigms including 3D convolutions with learnable temporal kernels, two-stream networks with explicit optical flow computation, and temporal segment networks that sample frames hierarchically to balance computational cost with temporal coverage
More thorough treatment of temporal modeling than general computer vision courses, with explicit comparison of 3D CNN vs two-stream vs transformer approaches and their computational trade-offs
audio-visual-synchronization-instruction
Medium confidenceTeaches methods for learning and leveraging audio-visual synchronization, including cross-modal self-supervised learning where audio and video streams are used to supervise each other without labeled data. Covers synchronization detection (determining if audio and video are temporally aligned), audio-visual source separation (isolating individual speakers from mixed audio using visual cues), and learning joint representations through contrastive objectives that maximize agreement between aligned modalities.
Focuses on leveraging natural audio-visual synchronization as a self-supervision signal through contrastive learning (maximizing similarity between aligned audio-video pairs while minimizing similarity to misaligned pairs), with explicit coverage of source separation using visual information to guide audio decomposition
Unique emphasis on audio-visual synchronization as a learning signal rather than treating audio and visual modalities independently, enabling self-supervised pre-training without manual annotations
cross-modal-retrieval-ranking-instruction
Medium confidenceTeaches methods for building retrieval systems that match queries in one modality (e.g., text) to candidates in another modality (e.g., images) using learned similarity metrics. Covers embedding-based retrieval where both modalities are projected into a shared space, ranking objectives like triplet loss and contrastive losses, and efficient indexing strategies (approximate nearest neighbor search) for scaling to millions of candidates while maintaining sub-second query latency.
Comprehensive treatment of embedding-based retrieval with explicit coverage of ranking objectives (triplet loss, contrastive losses, margin-based losses), efficient indexing via approximate nearest neighbor search (FAISS, LSH), and strategies for handling scale (millions of candidates) while maintaining sub-second latency
More focused on cross-modal retrieval specifics than general information retrieval courses, with emphasis on metric learning for aligning heterogeneous modalities rather than single-modality ranking
multimodal-representation-learning-instruction
Medium confidenceTeaches principles of learning joint representations where different modalities are mapped into a shared embedding space that captures semantic relationships. Covers self-supervised learning objectives (contrastive, masked modeling), alignment losses that encourage modality-specific encoders to produce compatible embeddings, and evaluation metrics for measuring the quality of learned representations (downstream task performance, retrieval metrics, linear probe accuracy).
Systematic treatment of multimodal representation learning with explicit coverage of alignment objectives (InfoNCE, triplet loss variants), modality-specific encoder design, and evaluation protocols that measure both representation quality (linear probe accuracy) and downstream task transfer performance
Deeper focus on multimodal-specific representation learning than general self-supervised learning courses, with emphasis on alignment between heterogeneous modalities rather than single-modality contrastive learning
visual-question-answering-instruction
Medium confidenceTeaches architectures and training strategies for visual question answering (VQA) systems that combine visual understanding with natural language reasoning. Covers attention mechanisms that identify relevant image regions for answering questions, fusion of visual features with question embeddings, and training objectives that handle multiple correct answers and answer frequency bias. Includes coverage of VQA datasets (VQA v2, GQA) and evaluation metrics (accuracy, BLEU, CIDEr).
Comprehensive treatment of VQA architectures including spatial attention (identifying relevant image regions), channel attention (weighting feature maps), and fusion strategies for combining visual and textual information, with explicit coverage of handling answer frequency bias through weighted loss functions
More specialized than general vision-language courses by focusing specifically on VQA task design, evaluation protocols, and known dataset biases that affect model performance
scene-understanding-semantic-segmentation-instruction
Medium confidenceTeaches methods for dense scene understanding including semantic segmentation (assigning class labels to every pixel), instance segmentation (distinguishing individual objects), and panoptic segmentation (unified segmentation of stuff and things). Covers encoder-decoder architectures with skip connections, multi-scale feature fusion, and how to leverage multimodal information (RGB-D, RGB-thermal) to improve segmentation accuracy in challenging conditions like low light or occlusion.
Covers dense prediction with explicit treatment of encoder-decoder architectures (FCN, U-Net, DeepLab), multi-scale feature fusion via dilated convolutions and atrous spatial pyramid pooling, and multimodal fusion strategies for RGB-D and RGB-thermal segmentation
More focused on dense prediction tasks than general computer vision courses, with emphasis on leveraging multiple sensor modalities to improve robustness in challenging conditions
multimodal-dataset-construction-annotation-instruction
Medium confidenceTeaches best practices for constructing and annotating multimodal datasets including data collection strategies, quality control mechanisms, inter-annotator agreement measurement, and handling of annotation disagreement. Covers practical considerations like managing multiple modalities with different temporal alignments, privacy-preserving data collection, and creating balanced datasets that avoid spurious correlations between modalities that models can exploit without learning robust representations.
Addresses multimodal-specific challenges in dataset construction including temporal synchronization across modalities, detection of spurious correlations that models can exploit, and annotation protocols that account for modality-specific ambiguities (e.g., visual ambiguity vs linguistic ambiguity)
More specialized than general data annotation guidance by addressing multimodal-specific challenges like temporal alignment, modality-specific shortcuts, and inter-modality consistency
multimodal-model-evaluation-benchmarking-instruction
Medium confidenceTeaches evaluation methodologies for multimodal models including task-specific metrics (accuracy, F1, BLEU, CIDEr for different modalities), robustness evaluation under distribution shift, and analysis of what each modality contributes to predictions. Covers ablation studies that measure modality importance, adversarial robustness testing, and creation of diagnostic datasets that isolate specific capabilities (e.g., compositional reasoning, counting, spatial relationships).
Comprehensive treatment of multimodal evaluation including modality-specific metrics, ablation studies that isolate modality contributions, diagnostic datasets for testing specific capabilities (compositional reasoning, counting), and robustness evaluation under modality-specific perturbations
More specialized than general model evaluation guidance by addressing multimodal-specific challenges like measuring modality contributions, evaluating robustness to modality-specific distribution shift, and creating diagnostic tests for multimodal reasoning
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with 11-877: Advanced Topics in MultiModal Machine Learning (Fall 2022) - Carnegie Mellon University, ranked by overlap. Discovered automatically through the match graph.
Tutorial on MultiModal Machine Learning (ICML 2023) - Carnegie Mellon University

11-777: MultiModal Machine Learning (Fall 2022) - Carnegie Mellon University

CSCI-GA.3033-102 Special Topic - Learning with Large Language and Vision Models
in Multimodal.
Qwen: Qwen3 VL 32B Instruct
Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-precision understanding and reasoning across text, images, and video. With 32 billion parameters, it combines deep visual perception with advanced text...
awesome-generative-ai-guide
A one stop repository for generative AI research updates, interview resources, notebooks and much more!
CS25: Transformers United V2 - Stanford University

Best For
- ✓ML researchers and engineers building vision-language models
- ✓Teams developing multimodal recommendation or retrieval systems
- ✓PhD students specializing in multimodal deep learning
- ✓ML engineers building search and retrieval systems
- ✓Researchers developing foundation models with multimodal capabilities
- ✓Teams implementing zero-shot vision applications
- ✓ML engineers building state-of-the-art multimodal models
- ✓Researchers developing foundation models with transformer architectures
Known Limitations
- ⚠Course material is from Fall 2022 — does not cover recent advances in vision transformers (ViT) or diffusion-based multimodal models post-2023
- ⚠Assumes strong foundational knowledge of deep learning, CNNs, and transformers — not suitable for beginners
- ⚠Focuses on academic research patterns rather than production deployment considerations like model compression or inference optimization
- ⚠Does not cover recent scaling techniques like vision-language pre-training on billions of image-text pairs (as in ALIGN, LiT)
- ⚠Limited coverage of instruction-tuning and fine-tuning strategies for downstream tasks
- ⚠Assumes access to large-scale datasets — practical guidance for smaller-scale training is minimal
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About

Categories
Alternatives to 11-877: Advanced Topics in MultiModal Machine Learning (Fall 2022) - Carnegie Mellon University
Are you the builder of 11-877: Advanced Topics in MultiModal Machine Learning (Fall 2022) - Carnegie Mellon University?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →