This is notes from the ZeRO - Memory Optimization paper
Data Parallelism (DP) and Model Parallelism (MP) are two ways to parallelize training.
Data Parallel | Model Parallel | |
---|---|---|
Split | Data is split across multiple GPUs and each GPU contains a copy of the modes | We split the model across multiple GPUs. Each GPU contains a copy of the data. |
Gradient Update | Gradients are averaged across GPUs and the model is updated. | |
Memory Efficiency | Low. Does not reduce the memory occupied by the model. Runs out of memory for models with more than 1.4B parameters. | High |
Compute/communication efficiency | High. | Low. Works well within a single node, due to high inter-gpu communication |