Distributed Training and Communication Protocols | Notion

Libraries

MPI: efficient CPU allreduce

Tutorials
dmlc/rabit: falut tolerant variant
facebookincubator/gloo
Parameter Hub: from UW
NCCL: Nvidia's efficient multi-GPU collective
- uses unified GPU direct memory accessing
- each GPU lanch a working kernel, cooperate with each other to do ring based reduction
- a single C++ kernel implements intra GPU synchronization and Reduction

PS-Lite Documents - ps-lite 1.0 documentation

Allreduce: Collective Reduction

Technologies behind Distributed Deep Learning: AllReduce

Interface: result = allreduce(float buffer[size])

grad = gradient(net, w)

for epoch, data in enumerate(dataset):
	g = net.run(grad, in=data)
	gsum = comm.allreduce(g, op=sum)
	w -= lr * gsum / num_workers

Tree Shape Reduction

Logically form a reduction tree between nodes
Aggregate to root then broadcast

o

Ring based Reduction

time complexity:

Parameter Server

Interface: key-value store

ps.push(index, gradient) && ps.pull(index)

Data Consistency: BSP