Training on many GPUs

single GPU is too slow or the model weights don’t fit in a single GPUs memory → mutli-GPU setup
requires some form of parallelism as the work needs to be distributed
parallism such as data, tensor, or pipeline parallism

have a look at it before diving into the following sections: single GPU section

Concepts

DataParallel (DP)
- the same setup is replicated multiple times
- each being fed a slice of the data
- The processing is done in parallel and all setups are synchronized at the end of each training step.
TensorParallel (TP)
- each tensor is split up into multiple chunks
- so instead of having the whole tensor reside on a single gpu, each shard of the tensor resides on its designated gpu
- During processing each shard gets processed separately and in parallel on different GPUs and the results are synced at the end of the step
- This is what one may call horizontal parallelism, as the splitting happens on horizontal level
PipelineParallel (PP)
- the model is split up vertically (layer-level) across multiple GPUs,
- so that only one or several layers of the model are places on a single gpu
- Each gpu processes in parallel different stages of the pipeline and working on a small chunk of the batch.
Zero Redundancy Optimizer (ZeRO)
- Also performs sharding of the tensors somewhat similar to TP, except the whole tensor gets reconstructed in time for a forward or backward computation
- therefore the model doesn’t need to be modified
- It also supports various offloading techniques to compensate for limited GPU memory.
Sharded DDP
- another name for the foundational ZeRO concept as used by various other implementations of ZeRO.