I've been writing some basic OpenCL programs and running them on a single device. I have been timing the performance of the code, to get an idea of how well each program is performing.
I have been looking at getting my kernels to run on the platforms GPU device, and CPU device at the same time. The cl::context constructor can be passed a std::vector of devices, to initialise a context with multiple devices. I have a system with a single GPU and a single CPU.
Is constructing a context with a vector of available devices all you have to do for kernels to be distributed to multiple devices? I noticed a significant performance increase when I did construct the context with 2 devices, but it seems too simple.