{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"awesome-scalable-diffusion-models-with-transformers-dit","slug":"scalable-diffusion-models-with-transformers-dit","name":"Scalable Diffusion Models with Transformers (DiT)","type":"product","url":"https://arxiv.org/abs/2212.09748","page_url":"https://unfragile.ai/scalable-diffusion-models-with-transformers-dit","categories":["productivity"],"tags":[],"pricing":{"model":"unknown","free":false,"starting_price":null},"status":"inactive","verified":false},"capabilities":[{"id":"awesome-scalable-diffusion-models-with-transformers-dit__cap_0","uri":"capability://image.visual.transformer.based.diffusion.image.generation.with.scalable.architecture","name":"transformer-based diffusion image generation with scalable architecture","description":"Replaces convolutional U-Net backbones in diffusion models with pure transformer architectures (DiT blocks), enabling linear scaling with model capacity and improved computational efficiency. Uses standard transformer layers with adaptive layer normalization (AdaLN) to inject diffusion timestep and class conditioning directly into attention mechanisms, eliminating separate conditioning pathways and reducing architectural complexity.","intents":["Scale image generation models to billions of parameters while maintaining training efficiency","Reduce memory footprint and latency compared to CNN-based diffusion architectures","Leverage existing transformer optimization infrastructure (flash attention, distributed training frameworks) for diffusion models","Generate high-resolution images with improved quality-to-parameter-count ratios"],"best_for":["ML researchers building large-scale generative models","Teams deploying image generation at scale with compute constraints","Organizations wanting to unify transformer infrastructure across NLP and vision tasks"],"limitations":["Requires substantial compute for training (reported experiments use 256-2048 GPUs); not practical for resource-constrained environments","Inference latency depends on sequence length of flattened image patches; high-resolution generation (1024x1024+) becomes expensive","Transformer attention is O(n²) in sequence length; image patch tokenization overhead increases with resolution","Requires careful tuning of patch embedding size and model depth; no universal hyperparameter recipe across resolutions"],"requires":["PyTorch 1.13+ with CUDA 11.8+ for efficient training","Distributed training framework (PyTorch DDP, DeepSpeed, or Megatron-LM) for multi-GPU scaling","Image dataset with 1M+ samples for meaningful convergence","GPU cluster with 40GB+ VRAM per device (A100/H100 recommended)"],"input_types":["image tensors (224x224 to 1024x1024 resolution)","class labels or text embeddings for conditioning","diffusion timestep indices (0-1000 range typical)"],"output_types":["image tensors (same resolution as input)","latent representations for downstream tasks"],"categories":["image-visual","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"awesome-scalable-diffusion-models-with-transformers-dit__cap_1","uri":"capability://image.visual.adaptive.layer.normalization.for.timestep.and.class.conditioning","name":"adaptive layer normalization for timestep and class conditioning","description":"Injects diffusion timestep and class information directly into transformer blocks via learned affine transformations (scale and shift) applied to layer normalization outputs, eliminating the need for separate conditioning networks or concatenation-based feature fusion. Each transformer block learns independent AdaLN parameters conditioned on timestep embeddings and optional class embeddings, enabling efficient multi-modal conditioning without architectural branching.","intents":["Condition diffusion generation on timestep and class labels without increasing model complexity","Reduce parameter overhead compared to concatenation-based conditioning in transformer blocks","Enable flexible conditioning on multiple modalities (time, class, text) with minimal architectural changes","Improve gradient flow during training by avoiding feature concatenation bottlenecks"],"best_for":["Researchers implementing conditional diffusion models with transformers","Teams needing efficient multi-modal conditioning without separate encoder networks","Projects requiring minimal overhead for adding new conditioning signals"],"limitations":["AdaLN parameters are learned per block; adding new conditioning modalities requires retraining or fine-tuning","Timestep embeddings must be pre-computed and passed through the model; no dynamic timestep adaptation during inference","Class conditioning assumes discrete labels; continuous conditioning signals require additional embedding layers","Scaling to 100+ conditioning dimensions may require careful initialization to avoid training instability"],"requires":["PyTorch 1.9+ for efficient layer normalization implementations","Timestep embedding module (sinusoidal or learned embeddings, typically 256-512 dims)","Class embedding table if using class conditioning (size = num_classes × embedding_dim)"],"input_types":["transformer block hidden states (batch_size, seq_len, hidden_dim)","timestep embeddings (batch_size, embedding_dim)","class embeddings (batch_size, embedding_dim) or None"],"output_types":["conditioned hidden states (batch_size, seq_len, hidden_dim)","scale and shift parameters for each block"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"awesome-scalable-diffusion-models-with-transformers-dit__cap_10","uri":"capability://data.processing.analysis.model.scaling.laws.and.parameter.efficiency.analysis","name":"model scaling laws and parameter efficiency analysis","description":"Analyzes how generation quality (FID/IS) scales with model size (parameters), training compute, and data, demonstrating that transformer-based diffusion models follow predictable scaling laws similar to language models. Enables principled decisions about model size, training duration, and data requirements by fitting power-law relationships between compute and quality metrics.","intents":["Predict generation quality for different model sizes without training all variants","Optimize model size and training compute for target quality levels","Compare parameter efficiency of different architectures (transformers vs CNNs)","Plan training budgets and resource allocation based on quality targets"],"best_for":["ML researchers studying generative model scaling","Teams planning large-scale model training with compute constraints","Organizations optimizing model size vs quality tradeoffs"],"limitations":["Scaling laws are empirical; extrapolation beyond observed range is unreliable","Scaling laws depend on training data quality and diversity; different datasets may have different exponents","Assumes fixed architecture and training procedure; changing these invalidates scaling law predictions","Requires training multiple model sizes (5-10 variants) to fit reliable scaling laws; expensive in compute"],"requires":["Multiple trained models of different sizes (100M, 300M, 1B, 3B parameters typical)","FID/IS metrics for each model size","Compute budget tracking (GPU-hours or FLOPs) for each training run"],"input_types":["model sizes (parameters)","training compute (GPU-hours or FLOPs)","FID/IS metrics for each model"],"output_types":["scaling law coefficients (power-law exponents)","predicted quality for arbitrary model sizes","optimal model size for target quality"],"categories":["data-processing-analysis","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"awesome-scalable-diffusion-models-with-transformers-dit__cap_2","uri":"capability://image.visual.patch.based.image.tokenization.for.transformer.input","name":"patch-based image tokenization for transformer input","description":"Converts images into sequences of flattened patch embeddings by dividing images into non-overlapping patches (e.g., 16x16 pixels), projecting each patch to a fixed embedding dimension via a linear layer, and flattening the spatial grid into a sequence. This enables transformer processing of images by converting 2D spatial data into 1D sequences compatible with standard attention mechanisms, with patch size as a tunable hyperparameter controlling sequence length and receptive field.","intents":["Convert 2D images into 1D sequences suitable for transformer processing","Control computational cost of attention by tuning patch size (larger patches = shorter sequences = faster attention)","Maintain spatial structure information through patch position embeddings","Enable resolution-agnostic models by using relative position embeddings"],"best_for":["Researchers implementing vision transformers for image generation or analysis","Teams needing to balance image resolution against computational budget","Projects requiring flexible resolution handling without retraining"],"limitations":["Patch size is fixed at training time; changing patch size requires retraining or interpolating position embeddings","Information loss at patch boundaries; fine details smaller than patch size may be lost","Sequence length grows quadratically with image resolution (e.g., 1024x1024 image with 16x16 patches = 4096 tokens); attention cost becomes prohibitive for very high resolutions","Patch embeddings are learned; initialization and training dynamics differ from pixel-level processing"],"requires":["PyTorch 1.9+ with efficient linear layer implementations","Image resolution divisible by patch size (e.g., 224x224 with 16x16 patches = 14x14 grid)","Position embedding table (size = (max_patches + 1) × embedding_dim if using learnable embeddings)"],"input_types":["images (batch_size, 3, height, width) with height and width divisible by patch_size","patch_size parameter (typically 8, 16, or 32)"],"output_types":["patch embeddings (batch_size, num_patches, embedding_dim)","position embeddings (num_patches, embedding_dim)"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"awesome-scalable-diffusion-models-with-transformers-dit__cap_3","uri":"capability://image.visual.diffusion.timestep.embedding.and.scheduling","name":"diffusion timestep embedding and scheduling","description":"Encodes diffusion timestep indices (0 to T-1) into continuous embeddings using sinusoidal positional encoding (similar to transformer position embeddings) or learned embeddings, then passes these embeddings through an MLP to produce conditioning vectors injected into each transformer block. Supports standard noise schedules (linear, cosine, quadratic) that define the variance schedule σ(t) used during training and inference, enabling flexible control over the diffusion process dynamics.","intents":["Encode discrete timestep indices into continuous representations for transformer conditioning","Control noise schedule during training and inference to balance quality and speed","Enable flexible timestep sampling strategies (uniform, importance-weighted) during training","Support variable-length diffusion processes (e.g., 50-step vs 1000-step inference)"],"best_for":["Researchers implementing diffusion models with transformers","Teams tuning noise schedules for specific image quality or inference speed requirements","Projects requiring custom timestep sampling strategies"],"limitations":["Timestep embeddings are learned during training; changing the noise schedule requires retraining","Sinusoidal embeddings assume fixed maximum timestep T; extrapolating beyond training T is not well-studied","Different noise schedules (linear vs cosine) produce different quality-speed tradeoffs; no universal optimal schedule","Importance-weighted timestep sampling requires computing loss weights for each timestep, adding training overhead"],"requires":["PyTorch 1.9+ for efficient embedding operations","Noise schedule definition (e.g., cosine schedule with T=1000 steps)","MLP for timestep embedding projection (typically 2-3 layers, 256-512 hidden dims)"],"input_types":["timestep indices (batch_size,) with values in range [0, T-1]","noise schedule parameters (e.g., β_start, β_end for linear schedule)"],"output_types":["timestep embeddings (batch_size, embedding_dim)","noise levels σ(t) for each timestep"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"awesome-scalable-diffusion-models-with-transformers-dit__cap_4","uri":"capability://automation.workflow.multi.gpu.distributed.training.with.gradient.checkpointing","name":"multi-gpu distributed training with gradient checkpointing","description":"Implements distributed training across multiple GPUs using PyTorch DDP or DeepSpeed, with gradient checkpointing to reduce memory usage by recomputing activations during backpropagation rather than storing them. Enables training of large DiT models (1B+ parameters) by distributing batch processing across GPUs and using activation checkpointing to trade compute for memory, critical for fitting models on 40GB+ VRAM devices.","intents":["Train billion-parameter diffusion models on multi-GPU clusters","Reduce per-GPU memory footprint to fit larger models on available hardware","Achieve near-linear scaling of training throughput with number of GPUs","Enable efficient fine-tuning of pre-trained models on limited hardware"],"best_for":["ML teams with access to multi-GPU clusters (8+ GPUs)","Researchers training large-scale generative models","Organizations deploying diffusion models at scale"],"limitations":["Gradient checkpointing adds ~20-30% training time overhead due to recomputation; tradeoff between memory and speed","Distributed training introduces synchronization overhead; effective batch size must be large (256+) to amortize communication costs","Requires careful tuning of learning rate and warmup schedule for multi-GPU training; naive scaling often leads to divergence","Communication bandwidth between GPUs becomes bottleneck for very large models; requires high-speed interconnect (NVLink, InfiniBand)"],"requires":["PyTorch 1.13+ with NCCL backend for GPU communication","Multiple GPUs with 40GB+ VRAM each (A100/H100 recommended)","High-speed GPU interconnect (NVLink or InfiniBand) for efficient all-reduce operations","DeepSpeed or Megatron-LM for advanced distributed training features"],"input_types":["image batches (batch_size, 3, height, width)","class labels or conditioning embeddings","distributed training configuration (num_gpus, batch_size_per_gpu)"],"output_types":["trained model weights (distributed across GPUs)","training logs and checkpoints"],"categories":["automation-workflow","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"awesome-scalable-diffusion-models-with-transformers-dit__cap_5","uri":"capability://image.visual.class.conditional.image.generation.with.learned.embeddings","name":"class-conditional image generation with learned embeddings","description":"Supports class-conditional generation by learning a class embedding table (num_classes × embedding_dim) that maps discrete class labels to continuous embeddings, which are then injected into transformer blocks via AdaLN. Enables controlled generation of specific object classes or categories by conditioning the diffusion process on class embeddings, with optional dropout of class embeddings during training for unconditional generation.","intents":["Generate images of specific classes (e.g., 'dog', 'car') by conditioning on class labels","Support classifier-free guidance by randomly dropping class conditioning during training","Enable fine-grained control over generated image content via discrete class labels","Train single models that can generate multiple object categories"],"best_for":["Researchers building class-conditional generative models","Teams needing controlled image generation for specific categories","Projects using classifier-free guidance for improved generation quality"],"limitations":["Class conditioning assumes discrete labels; continuous attributes require separate conditioning mechanisms","Classifier-free guidance requires training with random class dropout; adds training complexity and computational overhead","Class embeddings are learned; new classes require retraining or fine-tuning","Guidance scale (λ in classifier-free guidance) is a hyperparameter requiring tuning; no universal optimal value"],"requires":["Class label dataset (ImageNet-style with discrete class labels)","Class embedding table (num_classes × embedding_dim, typically 256-512 dims)","Dropout probability for classifier-free guidance (typically 0.1-0.2)"],"input_types":["class labels (batch_size,) with values in range [0, num_classes-1]","class embeddings (batch_size, embedding_dim)"],"output_types":["class-conditioned images (batch_size, 3, height, width)","guidance scale parameter for inference"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"awesome-scalable-diffusion-models-with-transformers-dit__cap_6","uri":"capability://image.visual.inference.time.guidance.scaling.for.quality.diversity.tradeoff","name":"inference-time guidance scaling for quality-diversity tradeoff","description":"Implements classifier-free guidance at inference time by computing predictions for both conditioned and unconditional diffusion paths, then blending them with a guidance scale parameter λ: x̂ = x̂_uncond + λ(x̂_cond - x̂_uncond). This enables post-hoc control over generation quality and diversity without retraining, trading inference speed (2x forward passes) for improved sample quality and stronger adherence to conditioning signals.","intents":["Improve generation quality at inference time without retraining","Control the tradeoff between diversity and adherence to class/text conditioning","Enable interactive generation with real-time quality adjustment via guidance scale","Boost sample quality for specific use cases (e.g., high-quality product images) without model changes"],"best_for":["Practitioners deploying diffusion models in production","Teams needing flexible quality-diversity tradeoffs without retraining","Applications requiring interactive generation with real-time parameter tuning"],"limitations":["Guidance scaling requires 2x forward passes (conditioned + unconditional); doubles inference latency","Guidance scale λ is a hyperparameter with no universal optimal value; requires empirical tuning per use case","Very high guidance scales (λ > 15) can produce artifacts or unrealistic images; requires careful tuning","Requires models trained with classifier-free guidance (random conditioning dropout); not compatible with models trained without dropout"],"requires":["Pre-trained diffusion model trained with classifier-free guidance (conditioning dropout > 0)","Guidance scale parameter λ (typically 1.0-15.0, with 7.5 as common default)","Inference scheduler (DDPM, DDIM, or other sampler)"],"input_types":["conditioning signal (class label, text embedding, or None)","guidance scale λ (float, typically 1.0-15.0)","noise schedule for inference (e.g., 50-step DDIM)"],"output_types":["guided images (batch_size, 3, height, width)","quality metrics (FID, IS) for different guidance scales"],"categories":["image-visual","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"awesome-scalable-diffusion-models-with-transformers-dit__cap_7","uri":"capability://image.visual.efficient.inference.with.ddim.sampling.and.step.reduction","name":"efficient inference with ddim sampling and step reduction","description":"Implements DDIM (Denoising Diffusion Implicit Models) sampling to reduce inference steps from 1000 (DDPM) to 50-100 steps with minimal quality loss, using a deterministic sampling procedure that skips timesteps while maintaining the diffusion trajectory. Enables fast inference by trading off some quality for speed, with configurable step counts allowing users to balance latency against sample fidelity.","intents":["Reduce inference latency from minutes (1000 DDPM steps) to seconds (50-100 DDIM steps)","Enable real-time or near-real-time image generation for interactive applications","Support variable-speed inference by tuning step count without retraining","Deploy diffusion models in latency-sensitive applications (e.g., web services, mobile)"],"best_for":["Practitioners deploying diffusion models in production with latency constraints","Teams building interactive generation applications","Applications requiring real-time or near-real-time generation"],"limitations":["DDIM sampling is deterministic; reduces diversity compared to stochastic DDPM sampling","Very aggressive step reduction (< 20 steps) produces noticeable quality degradation; requires empirical tuning per model","DDIM assumes linear noise schedule; non-linear schedules may require schedule adjustment","Inference speed is still O(num_steps × model_forward_pass); fundamental latency bottleneck remains"],"requires":["Pre-trained diffusion model (compatible with DDIM sampling)","DDIM sampler implementation (typically 50-100 lines of code)","Noise schedule definition (linear or cosine)"],"input_types":["initial noise (batch_size, 3, height, width)","conditioning signal (class label, text embedding, or None)","num_steps parameter (typically 20-100)"],"output_types":["generated images (batch_size, 3, height, width)","inference latency metrics"],"categories":["image-visual","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"awesome-scalable-diffusion-models-with-transformers-dit__cap_8","uri":"capability://image.visual.resolution.agnostic.generation.via.relative.position.embeddings","name":"resolution-agnostic generation via relative position embeddings","description":"Uses relative position embeddings instead of absolute position embeddings in transformer blocks, enabling the model to generalize to image resolutions not seen during training. Relative embeddings encode the distance between patches rather than absolute positions, allowing the same model to generate images at 256x256, 512x512, or 1024x1024 without retraining or position embedding interpolation.","intents":["Generate images at multiple resolutions without retraining or fine-tuning","Enable flexible resolution selection at inference time based on application requirements","Reduce training cost by training on single resolution and generalizing to others","Support variable-resolution batch processing in production systems"],"best_for":["Researchers building flexible generative models","Teams needing multi-resolution generation from single model","Applications with variable resolution requirements"],"limitations":["Relative position embeddings add complexity to attention computation; ~5-10% inference overhead vs absolute embeddings","Generalization to very different resolutions (e.g., 256x256 training → 2048x2048 inference) may degrade quality","Requires careful implementation of relative position bias computation; standard PyTorch transformers don't support this out-of-the-box","Patch size is still fixed at training time; changing patch size requires retraining"],"requires":["Custom relative position embedding implementation (typically 50-100 lines of code)","Transformer blocks with relative position bias support","Training on representative resolution (e.g., 512x512) for good generalization"],"input_types":["images at variable resolutions (height, width divisible by patch_size)","relative position bias parameters"],"output_types":["generated images at requested resolution","quality metrics for different resolutions"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"awesome-scalable-diffusion-models-with-transformers-dit__cap_9","uri":"capability://data.processing.analysis.fid.and.inception.score.evaluation.metrics.for.generation.quality","name":"fid and inception score evaluation metrics for generation quality","description":"Computes Fréchet Inception Distance (FID) and Inception Score (IS) metrics to quantitatively evaluate image generation quality by comparing generated images to real images using features from a pre-trained Inception network. FID measures the distance between feature distributions of real and generated images; IS measures the quality and diversity of generated images independently. Enables systematic comparison of model variants and hyperparameter choices.","intents":["Quantitatively evaluate generation quality during training and model selection","Compare different model architectures, hyperparameters, or training strategies","Track generation quality improvements over training iterations","Benchmark against published results and competing methods"],"best_for":["Researchers developing and comparing generative models","Teams tuning hyperparameters and architecture choices","Projects requiring quantitative quality metrics for model selection"],"limitations":["FID and IS are proxy metrics; high FID/IS doesn't guarantee perceptual quality or usefulness for downstream tasks","Both metrics require large sample sizes (10k+ images) for stable estimates; small sample FID is noisy","Inception network is trained on ImageNet; metrics may not reflect quality for out-of-distribution domains (medical images, artistic styles)","FID computation requires storing all real image features; memory-intensive for large datasets"],"requires":["Pre-trained Inception-v3 network (typically downloaded from torchvision)","Real image dataset (10k+ images) for computing reference statistics","Generated image samples (10k+ images) for evaluation","GPU for efficient feature extraction"],"input_types":["real images (batch_size, 3, 299, 299) resized to Inception input size","generated images (batch_size, 3, 299, 299)","pre-computed Inception features (optional, for efficiency)"],"output_types":["FID score (float, lower is better, typical range 0-100)","Inception Score (float, higher is better, typical range 1-100)","feature statistics (mean, covariance) for real and generated images"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":21,"verified":false,"data_access_risk":"low","permissions":["PyTorch 1.13+ with CUDA 11.8+ for efficient training","Distributed training framework (PyTorch DDP, DeepSpeed, or Megatron-LM) for multi-GPU scaling","Image dataset with 1M+ samples for meaningful convergence","GPU cluster with 40GB+ VRAM per device (A100/H100 recommended)","PyTorch 1.9+ for efficient layer normalization implementations","Timestep embedding module (sinusoidal or learned embeddings, typically 256-512 dims)","Class embedding table if using class conditioning (size = num_classes × embedding_dim)","Multiple trained models of different sizes (100M, 300M, 1B, 3B parameters typical)","FID/IS metrics for each model size","Compute budget tracking (GPU-hours or FLOPs) for each training run"],"failure_modes":["Requires substantial compute for training (reported experiments use 256-2048 GPUs); not practical for resource-constrained environments","Inference latency depends on sequence length of flattened image patches; high-resolution generation (1024x1024+) becomes expensive","Transformer attention is O(n²) in sequence length; image patch tokenization overhead increases with resolution","Requires careful tuning of patch embedding size and model depth; no universal hyperparameter recipe across resolutions","AdaLN parameters are learned per block; adding new conditioning modalities requires retraining or fine-tuning","Timestep embeddings must be pre-computed and passed through the model; no dynamic timestep adaptation during inference","Class conditioning assumes discrete labels; continuous conditioning signals require additional embedding layers","Scaling to 100+ conditioning dimensions may require careful initialization to avoid training instability","Scaling laws are empirical; extrapolation beyond observed range is unreliable","Scaling laws depend on training data quality and diversity; different datasets may have different exponents","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.05,"quality":0.22,"ecosystem":0.25,"match_graph":0.25,"freshness":0.5,"weights":{"adoption":0.25,"quality":0.25,"ecosystem":0.1,"match_graph":0.35,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"inactive","updated_at":"2026-06-17T09:51:04.048Z","last_scraped_at":"2026-05-03T14:00:27.894Z","last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=scalable-diffusion-models-with-transformers-dit","compare_url":"https://unfragile.ai/compare?artifact=scalable-diffusion-models-with-transformers-dit"}},"signature":"c6D3ceCsgCd5aRMg+qJMKoK8Ornrsfejm8c3Dr/efQHz/3DQ7vOTJKDHo2D0n/hanIOE2vTYtw/iBVIIHs/cDw==","signedAt":"2026-06-20T18:38:09.599Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/scalable-diffusion-models-with-transformers-dit","artifact":"https://unfragile.ai/scalable-diffusion-models-with-transformers-dit","verify":"https://unfragile.ai/api/v1/verify?slug=scalable-diffusion-models-with-transformers-dit","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}