make-a-video-pytorch
FrameworkFreeImplementation of Make-A-Video, new SOTA text to video generator from Meta AI, in Pytorch
Capabilities12 decomposed
factorized pseudo-3d convolution with axial decomposition
Medium confidenceImplements efficient pseudo-3D convolutions by factorizing full 3D operations into separate 2D spatial convolutions and 1D temporal convolutions, reducing computational complexity from O(D×H×W) to O(D+H+W). This PseudoConv3d module enables the model to leverage pre-trained 2D image weights while adding temporal processing, allowing video generation without retraining from scratch on massive video datasets.
Factorizes 3D convolutions into separable 2D+1D components rather than using full 3D kernels, enabling direct weight transfer from 2D image models while maintaining temporal expressiveness through dedicated 1D temporal convolutions
More parameter-efficient than full 3D convolutions (reduces parameters by ~70%) while maintaining better temporal coherence than naive frame-by-frame processing, enabling practical video generation on consumer hardware
spatiotemporal attention with cross-frame relationships
Medium confidenceImplements SpatioTemporalAttention module that applies attention mechanisms across both spatial dimensions (within frames) and temporal dimensions (across frames), capturing long-range dependencies between pixels within individual frames and semantic relationships across video frames. Uses Flash Attention for efficient computation, reducing quadratic attention complexity through kernel fusion and block-wise computation.
Combines spatial and temporal attention in a unified module rather than applying them sequentially, enabling direct modeling of spatiotemporal relationships; integrates Flash Attention for kernel-fused computation reducing memory bandwidth bottlenecks
More memory-efficient than standard multi-head attention (40-50% reduction with Flash Attention) while capturing richer temporal dependencies than frame-independent spatial attention, enabling longer coherent video generation
configurable temporal processing depth and granularity
Medium confidenceProvides fine-grained control over where and how temporal processing occurs in the network through configuration parameters like enable_time (global on/off), temporal_conv_depth (which layers include temporal convolutions), and attention_temporal_depth (which layers include temporal attention). This enables researchers to experiment with different temporal processing strategies without modifying core architecture code.
Exposes temporal processing configuration at multiple granularity levels (global, per-depth, per-layer) rather than fixed temporal processing patterns, enabling systematic exploration of temporal processing strategies
More flexible than fixed architectures while maintaining cleaner code than fully parameterized designs, enabling practical experimentation without architectural modifications
gradient checkpointing for memory-efficient training
Medium confidenceImplements gradient checkpointing (activation checkpointing) to reduce memory usage during training by recomputing activations during backward pass instead of storing them. This trades computation for memory, enabling larger batch sizes or longer videos on memory-constrained hardware. Checkpointing can be selectively enabled at different network depths.
Implements selective gradient checkpointing at multiple network depths rather than global checkpointing, enabling fine-tuned memory-computation tradeoffs
More memory-efficient than naive training while maintaining faster convergence than extreme batch size reduction, enabling practical training on consumer hardware
dual-mode image-video processing with dynamic temporal gating
Medium confidenceImplements SpaceTimeUnet architecture that processes both images and videos through the same model by dynamically enabling or disabling temporal processing layers based on input shape and enable_time parameter. When processing images (4D tensors), temporal convolutions and attention are skipped; when processing videos (5D tensors), full spatiotemporal processing is activated. This enables training on image datasets first, then fine-tuning on video data.
Single UNet architecture handles both image and video through runtime shape detection and conditional layer activation, rather than maintaining separate image and video models, enabling seamless transfer learning from image to video domain
More parameter-efficient than maintaining separate image and video models while enabling direct weight transfer from image pre-training, avoiding the need for expensive video-only training from scratch
hierarchical multi-scale feature processing with skip connections
Medium confidenceImplements standard UNet encoder-bottleneck-decoder architecture with skip connections across multiple resolution levels (typically 4-5 scales), allowing the model to capture both high-level semantic information (in bottleneck) and fine-grained spatial details (through skip connections). Each scale level uses ResnetBlock modules with optional temporal processing, enabling progressive refinement of generated video frames.
Combines standard UNet skip connections with spatiotemporal processing at each scale level, rather than applying temporal processing only at bottleneck, enabling temporal coherence to be maintained across all resolution levels
Better detail preservation than single-scale models while maintaining temporal consistency across scales, compared to naive multi-scale approaches that process spatial and temporal dimensions independently
text-to-video generation with diffusion-based denoising
Medium confidenceImplements text-to-video generation by integrating the SpaceTimeUnet with a diffusion process where the model learns to denoise progressively noisier video frames conditioned on text embeddings. The architecture accepts text prompts, encodes them into embeddings (typically via CLIP or similar), and uses these embeddings to guide the denoising process across multiple timesteps, generating coherent videos that match the text description.
Extends diffusion-based image generation to video by incorporating spatiotemporal processing throughout the denoising steps, rather than generating frames independently or using post-hoc temporal smoothing
More temporally coherent than frame-by-frame generation while maintaining the flexibility of diffusion models for diverse output generation, compared to autoregressive models that accumulate errors over long sequences
efficient temporal convolution with 1d kernels
Medium confidenceImplements 1D temporal convolutions as part of the PseudoConv3d factorization, processing temporal dimension separately from spatial dimensions. These 1D kernels operate along the frame axis, capturing temporal patterns and motion information with minimal computational overhead. The temporal convolutions are applied after spatial convolutions, enabling efficient sequential processing of temporal relationships.
Uses 1D temporal convolutions as part of factorized 3D operations rather than full 3D kernels, enabling direct reuse of 2D image model weights while adding lightweight temporal processing
More efficient than 3D convolutions (10-20x fewer parameters for temporal dimension) while capturing basic temporal patterns, though less expressive than full 3D convolutions for complex motion
resnet block with optional temporal processing
Medium confidenceImplements ResnetBlock modules that form the building blocks of the UNet architecture, featuring residual connections (skip connections within blocks) combined with optional temporal processing layers. Each block applies convolutions, normalization, and activation functions with a residual pathway, enabling deeper networks without vanishing gradients. Temporal processing can be selectively enabled or disabled per block.
Combines ResNet residual pathways with optional temporal processing layers, allowing temporal operations to be selectively enabled at different network depths rather than globally
More flexible than fixed temporal processing patterns while maintaining training stability benefits of residual connections, enabling fine-tuned control over temporal processing distribution
upsampling and downsampling with spatial-temporal awareness
Medium confidenceImplements Upsample and Downsample modules that change spatial resolution while preserving temporal information. Downsampling reduces spatial dimensions (H, W) while keeping frame count constant, enabling multi-scale processing. Upsampling increases spatial dimensions back to original resolution. These operations are designed to work seamlessly with both image (4D) and video (5D) tensors, maintaining temporal coherence during resolution changes.
Implements sampling operations that explicitly preserve temporal dimensions (frame count) while modifying spatial resolution, rather than treating video as 3D volume where all dimensions are sampled uniformly
More efficient than naive 3D sampling (which would reduce frame count) while maintaining temporal information, enabling practical multi-scale video processing
pre-trained image weight initialization and transfer learning
Medium confidenceEnables loading pre-trained 2D image model weights into the video model by mapping 2D convolution weights to the spatial components of PseudoConv3d modules. Temporal convolution kernels are initialized separately (typically with small random values or zero initialization). This approach allows leveraging large-scale image pre-training (ImageNet, LAION) to bootstrap video model training without requiring massive video datasets.
Implements selective weight transfer where only spatial convolution weights are loaded from 2D models while temporal components are initialized separately, enabling asymmetric transfer learning from image to video domain
More effective than random initialization (typically 20-30% faster convergence) while avoiding full retraining, compared to training video models from scratch which requires 10-100x more video data
batch processing with mixed image-video inputs
Medium confidenceSupports processing batches containing both images and videos by padding images to match video frame counts (typically adding dummy frames or repeating frames) and using the enable_time parameter to control temporal processing. The framework handles shape mismatches gracefully, allowing flexible batch composition for training scenarios where image and video data are mixed.
Handles heterogeneous batch composition (images and videos) through shape-aware padding and conditional temporal processing, rather than requiring separate batches for images and videos
More flexible than separate image-video pipelines while maintaining training efficiency, enabling better data utilization when video data is scarce
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with make-a-video-pytorch, ranked by overlap. Discovered automatically through the match graph.
MaxViT: Multi-Axis Vision Transformer (MaxViT)
* ⭐ 04/2022: [Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)](https://arxiv.org/abs/2204.06125)
oneformer_ade20k_swin_large
image-segmentation model by undefined. 1,02,623 downloads.
Scaling Vision Transformers to 22 Billion Parameters (ViT 22B)
* ⭐ 02/2023: [Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)](https://arxiv.org/abs/2302.05543)
video-diffusion-pytorch
Implementation of Video Diffusion Models, Jonathan Ho's new paper extending DDPMs to Video Generation - in Pytorch
LTX-Video
Official repository for LTX-Video
Qwen: Qwen3.5-35B-A3B
The Qwen3.5 Series 35B-A3B is a native vision-language model designed with a hybrid architecture that integrates linear attention mechanisms and a sparse mixture-of-experts model, achieving higher inference efficiency. Its overall...
Best For
- ✓researchers implementing text-to-video models with limited compute budgets
- ✓teams extending existing diffusion image models to video without massive retraining
- ✓video generation tasks requiring temporal consistency and smooth transitions
- ✓applications where frame-to-frame coherence is critical (character animation, scene transitions)
- ✓researchers optimizing temporal processing strategies
- ✓production systems requiring inference speed optimization
- ✓ablation studies investigating temporal processing effectiveness
- ✓training on consumer GPUs with limited VRAM (8-16GB)
Known Limitations
- ⚠factorization introduces approximation error compared to true 3D convolutions — spatial and temporal interactions are processed sequentially rather than jointly
- ⚠cannot capture complex spatiotemporal patterns that require simultaneous spatial-temporal feature mixing
- ⚠requires careful initialization of temporal convolution kernels to avoid training instability
- ⚠attention computation scales quadratically with sequence length — processing 24 frames at 512×512 resolution requires ~6GB VRAM even with Flash Attention optimizations
- ⚠temporal attention requires all frames to be in memory simultaneously, limiting maximum video length to ~30 frames on consumer GPUs
- ⚠attention patterns are learned during training and may not generalize well to video lengths significantly different from training data
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
Repository Details
Last commit: May 3, 2024
About
Implementation of Make-A-Video, new SOTA text to video generator from Meta AI, in Pytorch
Categories
Alternatives to make-a-video-pytorch
Are you the builder of make-a-video-pytorch?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →