https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/
https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/overlap-data-transfers/async.cu
http://on-demand.gputechconf.com/gtc/2015/presentation/S5530-Stephen-Jones.pdf
cudaStream_t stream1;
auto result = cudaStreamCreate(&stream1);
auto result = cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1);
increment<<<1, N, 0, stream1>>>(d_a);
auto result = cudaStreamDestroy(stream1);
cudaDeviceSynchroniza(): 阻塞 host,等待所有 device 上的操作完成,通常会极大损害性能cudaStreamSynchronize(stream) : 阻塞 host,等待所有特定 stream 上的操作完成cudaStreamQuery(stream):不阻塞,测试特定 stream 上的操作是否全部完成cudaEventSynchroniza(event) , cudaEventQuery(event)cudaStreamWaitEvent(event)通过 deviceQuery 可查询到 concurrent copy and execution,如图:

需要满足下列条件才能 overlapping