Introduction

Problems

  1. Which models should be co-located on each GPU, and how should they be scheduled to minimize mutual interference?
  2. Although batching increases latency, it usually stays within application-level bounds.
  3. Batching is only feasible to the same model.

Related

Large-scale short task serving problems

Distributed frontends that dispatch low-latency tasks to queues on the backend servers.

Both assume that the backend server allocation and task placement is performed at a higher level, using cluste resource managers such as Mesos or Omega.

Techenicals

Overview