Activation Checkpointing With Selective Layer Recomputation

1

DeepSpeedFramework57/100

Microsoft's distributed training library — ZeRO optimizer, trillion-parameter scale, RLHF.

Unique: Selective layer-wise checkpointing that recomputes only expensive layers (attention, MLP) while keeping normalization activations, achieving 30-50% memory reduction with <10% compute cost; uses gradient checkpointing API for transparent integration

vs others: More fine-grained than full-model checkpointing; lower overhead than storing all activations

2

make-a-video-pytorchFramework42/100

via “gradient checkpointing for memory-efficient training”

Implementation of Make-A-Video, new SOTA text to video generator from Meta AI, in Pytorch

Unique: Implements selective gradient checkpointing at multiple network depths rather than global checkpointing, enabling fine-tuned memory-computation tradeoffs

vs others: More memory-efficient than naive training while maintaining faster convergence than extreme batch size reduction, enabling practical training on consumer hardware

3

UnslothFramework27/100

via “gradient checkpointing with selective layer activation”

A Python library for fine-tuning LLMs [#opensource](https://github.com/unslothai/unsloth).

Unique: Implements selective layer checkpointing with automatic cost-benefit analysis that determines which layers to checkpoint based on memory footprint and computation cost, avoiding manual tuning while maintaining near-optimal memory-speed tradeoffs

vs others: More granular control than PyTorch's native gradient checkpointing, with automatic layer selection that reduces memory by 30-50% vs 20-30% for full checkpointing, and lower overhead than DeepSpeed's checkpointing through tighter integration with Unsloth kernels

Top Matches

Also Known As

Company