deepseed – notes

This is notes from the ZeRO - Memory Optimization paper

Data Parallelism (DP) and Model Parallelism (MP) are two ways to parallelize training.

	Data Parallel	Model Parallel
Split	Data is split across multiple GPUs and each GPU contains a copy of the modes	We split the model across multiple GPUs. Each GPU contains a copy of the data.
Gradient Update	Gradients are averaged across GPUs and the model is updated.
Memory Efficiency	Low. Does not reduce the memory occupied by the model. Runs out of memory for models with more than 1.4B parameters.	High
Compute/communication efficiency	High.	Low. Works well within a single node, due to high inter-gpu communication