Best Approach For Tiny Repeated Pipelined CUDA Kernel

Are there any better ways of performing simple scalar operations on device other than repeatedly launching tiny kernels? I am trying to fully pipeline a set of vector routines which use a combination of cuBLAS when available and hand rolled kernels when not. For example suppose you want to normalize a vector (assume each function call is fully asynchronous w.r.t. to the host):

__device__ double normDevice;
__host__   double normHost;

/* Contains cublas<t>nrm2() */
VecComputeNormAsync(vec, &normDevice, stream);
cudaMemcpyAsync(&normHost, &normDevice, sizeof(double), cudaMemcpyDeviceToHost, stream);
/* fence for memcpy */
normHost = 1.0/normHost;
cudaMemcpyAsync(&normDevice, &normHost, sizeof(double), cudaMemcpyHostToDevice, stream);
/* Now scaling by 1.0/norm, contains cublas<t>scal() */
VecScaleAsync(vec, normDevice, stream);

Not only does this hard-stop the pipeline, the memcpy up and down for single values feels so wasteful. Currently my best idea is the following micro kernel, but it seems wildly inefficient (not to mention I would have to provide such a function for every kind of operation):

__global__ void divDevice(double *numerator, double *denominator, double *ret)
  *ret = (*numerator)/(*denominator);

errorCode divAsync(double *numerator, bool isNumHost, double *denominator, bool isDenHost, double *retval, cudaStream_t stream) 
  /* convert either or both numerator or denominator to device */
  divDevice<<<1, 1, 0, stream>>>(numeratorDevice, denominatorDevice, retvalDevice);

Perhaps there is a clever way of having a constantly running kernel in the background, signalling to it which op you'd like it to perform and then placing the operands in memory for it?

Read more here:

Content Attribution

This content was originally published by Jacob Faib at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: