Efficient Training on Multiple GPUs
have a look at it before diving into the following sections: single GPU section
Model fits onto a single GPU
Model doesn’t fit onto a single GPU
PP
ZeRO
TP
With very fast intra-node connectivity of NVLINK or NVSwitch
all three should be mostly on par,
Without these
PP will be faster than TP or ZeRO.
The degree of TP may also make a difference. Best to experiment to find the winner on your particular setup.
TP is almost always used within a single node. That is TP size <= gpus per node.
Largest Layer not fitting into a single GPU