Masked Image Modeling With Discrete Visual Tokens

1

OctoRepository58/100

via “multimodal observation tokenization with flexible sensor composition”

Generalist robot policy model from Open X-Embodiment.

Unique: Implements a modular tokenizer architecture where image tokenizers (learned codebooks or pretrained vision models) and proprioception tokenizers (linear/MLP projections) are independently trained and composed, allowing flexible sensor configuration without retraining the transformer backbone. Supports variable numbers of cameras through dynamic token concatenation.

vs others: More flexible than end-to-end vision models that require fixed camera configurations, and more efficient than raw pixel processing by reducing observation dimensionality 100-1000x while preserving task-relevant information through learned tokenization.

2

Stable-DiffusionRepository48/100

via “dreambooth subject-specific model personalization”

FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,

Unique: Implements class-prior preservation loss (generating synthetic regularization images from base model during training) to prevent catastrophic forgetting; OneTrainer/Kohya automate the full pipeline including synthetic image generation, token selection validation, and learning rate scheduling based on dataset size

vs others: More stable than vanilla fine-tuning due to class-prior regularization; requires 10-100x fewer images than full fine-tuning; faster convergence (30-60 minutes) than Textual Inversion which requires 1000+ steps

3

bert-large-uncasedModel48/100

via “masked language model token prediction via bidirectional transformer attention”

fill-mask model by undefined. 11,20,072 downloads.

Unique: Implements true bidirectional context modeling through masked language modeling pretraining (unlike GPT's unidirectional approach), using WordPiece subword tokenization with 30,522 tokens and 24-layer transformer with 16 attention heads, trained on BookCorpus + Wikipedia for 1M steps with dynamic masking strategy

vs others: Outperforms RoBERTa and ELECTRA on GLUE benchmarks for token prediction tasks due to larger pretraining corpus, but slower inference than DistilBERT (40% parameter reduction) and less multilingual coverage than mBERT

4

mask2former-swin-tiny-coco-instanceModel41/100

via “iterative instance mask refinement via masked attention”

image-segmentation model by undefined. 63,563 downloads.

Unique: Applies masked cross-attention where attention weights are computed from previous-iteration masks, creating a feedback loop that focuses computation on uncertain regions. This differs from standard transformer decoders which attend uniformly to all features; the masking mechanism is learnable and trained end-to-end.

vs others: Achieves higher instance segmentation accuracy (+2-3 mAP) than single-pass methods like DETR by iteratively refining boundaries; trades off against faster inference-only methods which sacrifice accuracy for speed.

5

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3Leon)Product26/100

via “discrete image tokenization for unified sequence representation”

* ⏫ 07/2023: [Meta-Transformer: A Unified Framework for Multimodal Learning (Meta-Transformer)](https://arxiv.org/abs/2307.10802)

Unique: Uses discrete image tokenization to enable unified autoregressive processing of images and text in a single decoder, treating image generation as sequence prediction rather than pixel-space generation

vs others: Simpler than continuous image representations because it reuses text token infrastructure; enables unified architecture but trades off visual fidelity compared to continuous or diffusion-based approaches

6

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks (BEiT)Product23/100

via “discrete visual tokenization with learned codebook”

* ⭐ 09/2022: [PaLI: A Jointly-Scaled Multilingual Language-Image Model (PaLI)](https://arxiv.org/abs/2209.06794)

Unique: Uses learned discrete codebooks to tokenize images, creating a bridge between continuous vision features and discrete language tokens. This enables applying BERT-style masked language modeling directly to images without pixel-level reconstruction.

vs others: Provides better semantic alignment with language models than continuous feature representations because discrete tokens create a shared vocabulary between modalities, improving joint vision-language learning compared to approaches using separate continuous representations.

7

Muse: Text-To-Image Generation via Masked Generative Transformers (Muse)Product23/100

via “vq-vae discrete tokenization for image compression and generation”

* ⭐ 02/2023: [Structure and Content-Guided Video Synthesis with Diffusion Models (Gen-1)](https://arxiv.org/abs/2302.03011)

Unique: Leverages learned discrete codebook from VQ-VAE rather than fixed quantization schemes, allowing the model to learn task-specific token representations that optimize for image generation quality rather than reconstruction fidelity

vs others: More efficient than pixel-space diffusion models because token sequences are 256x shorter than pixel sequences, reducing transformer computation from O(n²) to O(n²/256²) while maintaining competitive image quality

Top Matches

Also Known As

Company