Vq Vae Discrete Tokenization For Image Compression And Generation

1

stable-diffusion-xl-base-1.0Model57/100

via “vae latent encoding and decoding with quality-speed tradeoff”

text-to-image model by undefined. 20,41,667 downloads.

Unique: Implements 8× spatial compression VAE enabling efficient diffusion in latent space; includes tiling mode for processing images larger than training resolution without retraining or cascading upsampling

vs others: More efficient than pixel-space diffusion (64× memory reduction); tiling approach avoids cascading upsampling artifacts; comparable to other latent diffusion models but with explicit tiling support for large images

2

stable-diffusion-v1-5Model54/100

via “vae-based latent space compression and reconstruction”

text-to-image model by undefined. 14,81,468 downloads.

Unique: Uses a pre-trained VAE with 4x4x4 compression ratio, reducing diffusion computation by ~16x compared to pixel-space diffusion; VAE is frozen (not fine-tuned during generation), ensuring stable and predictable compression

vs others: More efficient than pixel-space diffusion (DDPM) and more stable than learned compression methods; compression ratio is fixed and well-understood, unlike adaptive or learned compression schemes

3

DALLE2-pytorchFramework51/100

via “latent diffusion with vqganvae compression for memory-efficient training”

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch

Unique: Provides explicit VQGanVAE integration as a preprocessing and decoding layer, allowing users to toggle between pixel-space and latent-space training without architectural changes. Includes utilities for batch encoding datasets to latent codes, enabling reproducible training workflows.

vs others: More memory-efficient than Stable Diffusion's approach (which uses VAE but less explicit control) and more flexible than pixel-space DALL-E 2 because users can swap VQGanVAE variants or use alternative compression schemes without rewriting core logic.

4

stable-diffusion-v1-4Model51/100

via “variational autoencoder (vae) latent encoding and decoding”

text-to-image model by undefined. 6,21,488 downloads.

Unique: Uses a learned VAE with KL divergence regularization (β=0.18) to balance reconstruction quality and latent space smoothness. Operates at 8x spatial compression (512→64) while maintaining perceptual quality through a decoder trained jointly with the encoder.

vs others: More efficient than pixel-space diffusion (DALL-E, Imagen) while maintaining quality comparable to full-resolution models; enables consumer-grade hardware deployment where pixel-space models require enterprise infrastructure.

5

FLUX.1-devModel51/100

via “vae latent space encoding and decoding”

text-to-image model by undefined. 7,33,924 downloads.

Unique: Uses learned VAE compression rather than fixed downsampling, enabling perceptually-aware compression that preserves semantic content while reducing spatial dimensions; enables efficient latent space manipulation for inpainting and editing

vs others: More efficient than pixel-space diffusion (64x compression); more quality-preserving than naive downsampling because VAE learns task-specific compression; enables latent-space editing workflows that pixel-space models cannot support

6

DALLE-pytorchFramework50/100

via “pluggable vae abstraction with multiple encoder implementations”

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Unique: Abstracts VAE as a swappable component with three concrete implementations (custom trainable, pre-trained OpenAI, VQGan), allowing researchers to isolate VAE quality from transformer training. Supports different codebook sizes (1024, 8192) enabling explicit compression-quality trade-off exploration.

vs others: More flexible than monolithic implementations; allows using OpenAI's pre-trained VAE without training, or training custom VAEs for domain adaptation—advantages over closed-source APIs that don't expose encoder/decoder.

7

stable-diffusion-xl-1.0-inpainting-0.1Model48/100

via “vae-based image encoding and decoding with latent compression”

text-to-image model by undefined. 2,97,544 downloads.

Unique: SDXL uses a specialized VAE architecture with improved reconstruction fidelity compared to earlier SD versions, incorporating residual blocks and attention mechanisms in the decoder to minimize artifacts. The encoder produces a distribution rather than point estimates, enabling stochastic sampling for diversity in inpainting.

vs others: SDXL's VAE produces sharper reconstructions than SD 1.5's VAE due to improved decoder architecture, while maintaining the same 4x compression ratio for compatibility with existing latent-space workflows.

8

stable-diffusion-inpaintingModel47/100

via “vae-based latent encoding and decoding”

text-to-image model by undefined. 2,18,560 downloads.

Unique: Uses a KL-divergence regularized VAE trained on 512x512 images with a fixed 8x spatial compression ratio, balancing reconstruction fidelity against latent space smoothness. The encoder produces both mean and log-variance for stochastic sampling, enabling controlled exploration of the latent manifold through the scale_factor parameter.

vs others: More efficient than pixel-space diffusion (8x faster) because latent space has lower dimensionality; higher quality than aggressive JPEG compression because VAE is trained end-to-end on natural images; less flexible than learnable compression because scaling factor is fixed.

9

sd-turboModel46/100

via “vae latent encoding and decoding for image compression”

text-to-image model by undefined. 6,08,507 downloads.

Unique: Uses a pre-trained VAE (trained on ImageNet) to compress images into a 4x-smaller latent space, enabling the diffusion process to operate on 64x64 tensors instead of 512x512 pixels, reducing computation by 16x and memory by 16x; the same VAE is shared across all Stable Diffusion v1.x and v2.x checkpoints, ensuring consistency

vs others: More efficient than pixel-space diffusion (DDPM) which requires full-resolution processing, but introduces compression artifacts; more standardized than custom latent spaces in proprietary models like Dall-E which use non-standard compression schemes

10

stable-diffusion-v1-5Model46/100

via “vae-based latent encoding and decoding”

text-to-image model by undefined. 7,85,165 downloads.

Unique: Stable Diffusion v1.5 uses a frozen, pre-trained VAE with a fixed scaling factor (0.18215) to normalize latent variance. This design choice prioritizes stability and reproducibility over reconstruction fidelity, enabling reliable diffusion training without VAE collapse.

vs others: More efficient than pixel-space diffusion because 64x64 latents require 64x fewer diffusion steps to cover the same semantic space; more stable than learned latent scaling because the scaling factor is fixed and tuned for diffusion training

11

Dreambooth-Stable-DiffusionRepository46/100

via “image preprocessing and augmentation with resolution normalization”

Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) with Stable Diffusion

Unique: Combines image preprocessing with VAE latent encoding in a single pipeline, reducing memory overhead by operating on 4x-downsampled latent representations rather than full-resolution images during training.

vs others: More efficient than pixel-space training (4x memory reduction) and more flexible than fixed-resolution inputs, but introduces VAE encoding artifacts and requires careful augmentation tuning to avoid losing subject details.

12

InfinityRepository45/100

via “visual tokenization with variable-resolution vae supporting 2^16 to 2^64 vocabulary sizes”

[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Unique: Supports variable vocabulary sizes (2^16 to 2^64) through configurable quantization, enabling dynamic quality-latency trade-offs. Unlike fixed-vocabulary tokenizers (e.g., VQ-VAE with 8192 tokens), Infinity's VAE can scale vocabulary exponentially without retraining, adapting to different deployment constraints.

vs others: Provides 4-8× more vocabulary flexibility than fixed-vocabulary tokenizers, enabling fine-grained control over reconstruction quality and sequence length without model retraining.

13

Qwen-Image-LightningModel45/100

via “efficient latent-space image generation with vae decoding”

text-to-image model by undefined. 3,26,804 downloads.

Unique: Leverages Qwen-Image's pre-trained VAE decoder to convert diffusion-generated latents to images, with latent space dimensionality and scaling factors optimized for the distilled model's architecture rather than generic VAE implementations

vs others: Achieves faster inference than pixel-space diffusion models like DALL-E while maintaining quality comparable to full-resolution approaches, and more efficient than naive latent-space approaches by using a VAE specifically tuned to the model's training distribution

14

CogViewRepository44/100

via “tokenization-aware data pipeline with vq-vae image encoding”

Text-to-Image generation. The repo for NeurIPS 2021 paper "CogView: Mastering Text-to-Image Generation via Transformers".

Unique: Integrates VQ-VAE image tokenization directly into the data pipeline, enabling end-to-end discrete tokenization of both images and text. Dataset classes (in data_utils.py) handle lazy loading and caching of tokenized data, reducing per-epoch preprocessing overhead compared to on-the-fly encoding.

vs others: More efficient than on-the-fly VQ-VAE encoding during training, but requires upfront preprocessing and disk space; simpler than pixel-space data augmentation due to fixed token vocabulary.

15

min-dalleRepository43/100

via “vqgan detokenization for pixel-space image reconstruction”

min(DALL·E) is a fast, minimal port of DALL·E Mini to PyTorch

Unique: Uses pre-trained VQGan decoder (not a custom decoder), ensuring compatibility with tokens generated by the DALL·E Bart decoder which was trained on VQGan-tokenized images. Supports progressive detokenization via iterator pattern, enabling real-time image rendering without waiting for full token sequence.

vs others: More efficient than diffusion-based decoding (1-2s vs 30-60s) because it's a single forward pass; maintains higher fidelity than upsampling-based approaches because it uses learned reconstruction rather than interpolation.

16

diffusersRepository28/100

via “vae latent space compression and reconstruction with learned bottleneck”

State-of-the-art diffusion in PyTorch and JAX.

Unique: Uses learned VAE encoder/decoder to compress images to 4-8x spatial downsampling, enabling diffusion in latent space rather than pixel space. This reduces memory by 16-64x and compute by 4-16x while maintaining quality through the VAE's learned reconstruction, unlike naive downsampling approaches.

vs others: More efficient than pixel-space diffusion and maintains better quality than vector quantization approaches; introduces 5-10% quality loss compared to pixel-space generation and adds encoder/decoder latency.

17

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3Leon)Product24/100

via “discrete image tokenization for unified sequence representation”

* ⏫ 07/2023: [Meta-Transformer: A Unified Framework for Multimodal Learning (Meta-Transformer)](https://arxiv.org/abs/2307.10802)

Unique: Uses discrete image tokenization to enable unified autoregressive processing of images and text in a single decoder, treating image generation as sequence prediction rather than pixel-space generation

vs others: Simpler than continuous image representations because it reuses text token infrastructure; enables unified architecture but trades off visual fidelity compared to continuous or diffusion-based approaches

18

dalle-miniModel22/100

via “vqgan-based image decoding from latent tokens”

dalle-mini — AI demo on HuggingFace

Unique: Operates diffusion in discrete token space rather than continuous pixel space, reducing diffusion steps by 4-8x and enabling inference on consumer hardware; VQGAN codebook is pre-trained on ImageNet, providing strong inductive bias for natural image structure

vs others: Significantly faster than pixel-space diffusion (Stable Diffusion) on same hardware, and more memory-efficient than continuous latent diffusion; trade-off is lower image quality due to quantization artifacts and limited resolution compared to modern pixel-space models

19

Muse: Text-To-Image Generation via Masked Generative Transformers (Muse)Product21/100

via “vq-vae discrete tokenization for image compression and generation”

* ⭐ 02/2023: [Structure and Content-Guided Video Synthesis with Diffusion Models (Gen-1)](https://arxiv.org/abs/2302.03011)

Unique: Leverages learned discrete codebook from VQ-VAE rather than fixed quantization schemes, allowing the model to learn task-specific token representations that optimize for image generation quality rather than reconstruction fidelity

vs others: More efficient than pixel-space diffusion models because token sequences are 256x shorter than pixel sequences, reducing transformer computation from O(n²) to O(n²/256²) while maintaining competitive image quality

20

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks (BEiT)Product21/100

via “discrete visual tokenization with learned codebook”

* ⭐ 09/2022: [PaLI: A Jointly-Scaled Multilingual Language-Image Model (PaLI)](https://arxiv.org/abs/2209.06794)

Unique: Uses learned discrete codebooks to tokenize images, creating a bridge between continuous vision features and discrete language tokens. This enables applying BERT-style masked language modeling directly to images without pixel-level reconstruction.

vs others: Provides better semantic alignment with language models than continuous feature representations because discrete tokens create a shared vocabulary between modalities, improving joint vision-language learning compared to approaches using separate continuous representations.

Top Matches

Also Known As

Company