Notes from Mu Li‘s Talk.

Challenges

Massive communication traffic, and limited communication bandwidth (10x less than memory bandwidth)
Large synchronization cost (1ms latency
Job failures

Scaling Distributed Machine Learning

Distributed Systems （系统）

Large data size, complex models
Fault tolerant
Easy to use

Large Scale Optimization（算法）

Communication efficient 降低通信开销
- 尽量少通讯
- 通信信息压缩
- 放松数据一致性
Convergence guarantee

Method 系统

Parameter Server for machine Learning
MXNet for deep learning

Method 优化算法

DBPG for non-convex non-smooth f_i
EMSO for efficient minibatch SGD

With appropriate computational frameworks and algorithm design, distributed machine learning can be made simple, fast, and scalable, both in theory and in practice.

核心是 co-design，即算法和系统一起考虑。系统提供了足够多的支持的情况下算法可以更简单，比如在 MXNet 及时是很暴力的做同步通信，系统也能自动的做并行。同样，通信和计算比较高，需要更大的 batch size，也就需要更好的算法。

Parameter Server

Existing Open Source Systems in 2012

MPI (message passing interface)
- Hard to use for sparse problems
- No fault tolerance
Key-value store, e.g. redis
- Expensive individual key-value pair communication
- Difficult to praogram on the server side
Hadoop/Spark
- BSP (bulk synchronous parallel) data consistency makes efficient implementation challengeing
- 每一次都要做同步，不灵活

Challenges

Scaling Distributed Machine Learning

Parameter Server

Existing Open Source Systems in 2012

Architecture