via “chinese text-to-image generation via autoregressive transformer tokenization”
Text-to-Image generation. The repo for NeurIPS 2021 paper "CogView: Mastering Text-to-Image Generation via Transformers".
Unique: Unified autoregressive transformer architecture that treats text and images as discrete token sequences, enabling a single 4B-parameter model to handle generation, captioning, super-resolution, and reranking without task-specific heads. Uses VQ-VAE tokenization (8192 codes) to convert images to sequences, enabling transformer-based sequence prediction instead of pixel-space diffusion.
vs others: Simpler unified architecture than task-specific models, but slower inference than diffusion-based alternatives and limited to Chinese input in v1; stronger than concurrent autoregressive models (VQGAN-CLIP, DALL-E v1) in handling long-range dependencies via transformer attention.