Published

February 19, 2023

This is notes from the ZeRO - Memory Optimization paper

Data Parallelism (DP) and Model Parallelism (MP) are two ways to parallelize training.

Data Parallel Model Parallel
Split Data is split across multiple GPUs and each GPU contains a copy of the modes We split the model across multiple GPUs. Each GPU contains a copy of the data.
Gradient Update Gradients are averaged across GPUs and the model is updated.
Memory Efficiency Low. Does not reduce the memory occupied by the model. Runs out of memory for models with more than 1.4B parameters. High
Compute/communication efficiency High. Low. Works well within a single node, due to high inter-gpu communication

image from paper