Notes from Mu Li‘s Talk.
Challenges
- Massive communication traffic, and limited communication bandwidth (10x less than memory bandwidth)
- Large synchronization cost (1ms latency
- Job failures
Scaling Distributed Machine Learning
Distributed Systems (系统)
- Large data size, complex models
- Fault tolerant
- Easy to use
Large Scale Optimization(算法)
- Communication efficient 降低通信开销
- Convergence guarantee
Method 系统
- Parameter Server for machine Learning
- MXNet for deep learning
Method 优化算法
- DBPG for non-convex non-smooth f_i
- EMSO for efficient minibatch SGD
With appropriate computational frameworks and algorithm design, distributed machine learning can be made simple, fast, and scalable, both in theory and in practice.
核心是 co-design,即算法和系统一起考虑。系统提供了足够多的支持的情况下算法可以更简单,比如在 MXNet 及时是很暴力的做同步通信,系统也能自动的做并行。同样,通信和计算比较高,需要更大的 batch size,也就需要更好的算法。
Parameter Server
Existing Open Source Systems in 2012
- MPI (message passing interface)
- Hard to use for sparse problems
- No fault tolerance
- Key-value store, e.g. redis
- Expensive individual key-value pair communication
- Difficult to praogram on the server side
- Hadoop/Spark
- BSP (bulk synchronous parallel) data consistency makes efficient implementation challengeing
- 每一次都要做同步,不灵活
Architecture