Introduction
- Fundamental problem: to distribute the large incoming workload on to a cluster of accelerators at high accelerator utilization and acceptable latency. (sharding inputs through a distributed frontend onto DNNs on "backend" GPUs)
- Factors
-
Place different networks on the same GPU. Should schedule them to maximize their combined throughput while satisfying latency bounds.
-
Applications consists of groups of DNNs that feed to each other. Should schedule the execution of the entire group.
-
Batching is more efficiently
a. benefits from cross-tenant and cross-request codination
b. forces the underlying bin-packing-based algorithms to incorporate batch size
-
Common use of transfer learning led to specialization of networks and lost batching benefits. (two tasks use networks only mostly identical).
- Techniques
- Novel batching-aware scheduler that perform bin packing with variable size. Specifies the # of GPUs, the distribution of DNNs across them and the execution order.
- Allows groups of related DNN invocations to be written as queries and provides automated complex query scheduling to assign optimal batch sizes.
- Aallows batching of parts of networks with different batch sizes.
- Features
- Scale: Automates resources (GPUs) allocation and model scheduling across resources. (distrubuted frontend, work-sharding)
- Expressivity: Query mechanism that (a) allows group related DNN execution tasks, and (b) allows the user to specify the latency SLO at the query-level.
- Granularity: Identifies common subgraphs and executes them in batch.
Problems
- Which models should be co-located on each GPU, and how should they be scheduled to minimize mutual interference?
- Although batching increases latency, it usually stays within application-level bounds.
- Variable batch size complicates system design.
- resource quantum on each input is "squishy"
- latency depends on batch size
- Batching is only feasible to the same model.
Related
- Clipper: Selects the type of model to serve it, batches requests with adapting batch sizes, and forwards the batched requests to a backend container.
- TensorFlow Serving: Does not provide adaptive batching and caching, but has machinery for versioning models.
Large-scale short task serving problems
Distributed frontends that dispatch low-latency tasks to queues on the backend servers.
- Sparrow: focus on dispatch stratedies to reduce the delays associated with queuing in such systems.
- Slicer: provide a fast, fault-tolerant e for diving the back end into shards and load balancing across them.
Both assume that the backend server allocation and task placement is performed at a higher level, using cluste resource managers such as Mesos or Omega.
Techenicals
Overview