Streams

Non-default

cudaStream_t stream1;
auto result = cudaStreamCreate(&stream1);
auto result = cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1);
increment<<<1, N, 0, stream1>>>(d_a);
auto result = cudaStreamDestroy(stream1);

Synchronization with streams

Overlapping Kernel Execution and Data Transfers

通过 deviceQuery 可查询到 concurrent copy and execution,如图:

需要满足下列条件才能 overlapping

  1. "concurrent copy and execution" is capable
  2. execution and transfer both occur in different, non-default streams