Introduction

Fundamental problem: to distribute the large incoming workload on to a cluster of accelerators at high accelerator utilization and acceptable latency. (sharding inputs through a distributed frontend onto DNNs on "backend" GPUs)
- Factors
  1. Place different networks on the same GPU. Should schedule them to maximize their combined throughput while satisfying latency bounds.
  2. Applications consists of groups of DNNs that feed to each other. Should schedule the execution of the entire group.
  3. Batching is more efficiently
    
    a. benefits from cross-tenant and cross-request codination
    
    b. forces the underlying bin-packing-based algorithms to incorporate batch size
  4. Common use of transfer learning led to specialization of networks and lost batching benefits. (two tasks use networks only mostly identical).
Techniques
1. Novel batching-aware scheduler that perform bin packing with variable size. Specifies the # of GPUs, the distribution of DNNs across them and the execution order.
2. Allows groups of related DNN invocations to be written as queries and provides automated complex query scheduling to assign optimal batch sizes.
3. Aallows batching of parts of networks with different batch sizes.
Features
- Scale: Automates resources (GPUs) allocation and model scheduling across resources. (distrubuted frontend, work-sharding)
- Expressivity: Query mechanism that (a) allows group related DNN execution tasks, and (b) allows the user to specify the latency SLO at the query-level.
- Granularity: Identifies common subgraphs and executes them in batch.

Problems

Which models should be co-located on each GPU, and how should they be scheduled to minimize mutual interference?
Although batching increases latency, it usually stays within application-level bounds.
- Variable batch size complicates system design.
  - resource quantum on each input is "squishy"
  - latency depends on batch size
Batching is only feasible to the same model.

Clipper: Selects the type of model to serve it, batches requests with adapting batch sizes, and forwards the batched requests to a backend container.
TensorFlow Serving: Does not provide adaptive batching and caching, but has machinery for versioning models.

Large-scale short task serving problems

Distributed frontends that dispatch low-latency tasks to queues on the backend servers.

Sparrow: focus on dispatch stratedies to reduce the delays associated with queuing in such systems.
Slicer: provide a fast, fault-tolerant e for diving the back end into shards and load balancing across them.

Both assume that the backend server allocation and task placement is performed at a higher level, using cluste resource managers such as Mesos or Omega.

Introduction

Problems

Related

Large-scale short task serving problems

Techenicals

Overview