Welcome
» NERWous C
» Examples
Current Literature
CUDA is the parallel computing platform and programming model from NVIDIA. It allows programs to run on the main Central Processing Unit (CPU) and offload heavily mathematical computation tasks to NVIDIA Graphical Processing Units (GPUs). A computer host can have one or more CPUs and one or more GPUs. All these computer resources can be harnessed by the CUDA parallel programming model.
Let's take a look at this CUDA example that does addition of two arrays of 1 million elements:
The
The above CUDA example uses a technique called grid-stride loop. Instead of running the
NERWous C Sample
This is the NERWous C sample correspondent:
The additions are done by the function
The
The matrixes
Previous Next Top
Current Literature
CUDA is the parallel computing platform and programming model from NVIDIA. It allows programs to run on the main Central Processing Unit (CPU) and offload heavily mathematical computation tasks to NVIDIA Graphical Processing Units (GPUs). A computer host can have one or more CPUs and one or more GPUs. All these computer resources can be harnessed by the CUDA parallel programming model.
Let's take a look at this CUDA example that does addition of two arrays of 1 million elements:
#include <iostream>
#include <math.h>
#define NUMBLOCKS 4
#define NUMTHREADSPERBLOCK 256
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < n;
i += blockDim.x * gridDim.x)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Run kernel on 1M elements on the GPU
// on 4 thread blocks with 256 threads per block for a total of 1024 threads
add<<<NUMBLOCKS, NUMTHREADSPERBLOCK>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
The specifier __global__
to the function ''add'' tells the CUDA C++ compiler that "this is a function that runs on the GPU and can be called from CPU code".
The
cudaMallocManaged
CUDA function allocates memory for the shared arrays from the Unified Memory space. This memory space is accessible to both the main CPU and the GPU.
The above CUDA example uses a technique called grid-stride loop. Instead of running the
add
function for 1M items in 1M virtual threads, it runs based on the capability of the hardware on hand. While programming with the unlimited threads assumption results in pretty concise code, in practice, the cost of setting up and tearing down actual processing units to map to virtual threads imposes a performance penalty on the running program. The grid-stride loop strategy identifies the number of available executing units, and then divides the work to these units. In the above CUDA example, NUMBLOCKS * NUMTHREADSPERBLOCK
units (or threads in CUDA terminology) divvy up the 1 M addition tasks, with each thread handle 1 M / NUMBLOCKS * NUMTHREADSPERBLOCK
additions.
NERWous C Sample
This is the NERWous C sample correspondent:
#define NUMBLOCKS 4
#define NUMTHREADSPERBLOCK 256
#define NUMTHREADS NUMBLOCKS*NUMTHREADSPERBLOCK
void add(int n, mel float *x, mel float *y)
{
<cel> mythread = pel<location>;
int index = atoi(mythread<tag>);
int stride = NUMTHREADS;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
main () {
int N = 1<<20; /* 1 M items */
// Allocate Unified Memory – accessible from CPU or GPU
<mel at=UnifiedMemory> float x[N];
<mel at=UnifiedMemory> float y[N];
// Initialize x and y arrays on the host
<?> (x, y) {
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
}
// Run kernel on 1M elements on the GPU
<cel> threads[NUMTHREADS];
<collect> for (int index = 0; index < NUMTHREADS; ++index)
<! at=threads[index]> {
add (N, x, y, index, numthreads);
} <? ENDED>; // Wait for GPU to finish before accessing on host
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
<?>(y) {
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
}
printf ( "Max error: %d", maxError);
// Free memory
<close> x;
<close> y;
return 0;
}
The CUDA example identifies NUMBLOCKS*NUMTHREADSPERBLOCK
threads. These are defined in NERWous C as cel array variable threads
:
<cel> threads[NUMTHREADS];
The for (int index = 0; index < NUMTHREADS; ++index)
loop divvies the 1-Million N
items into the available threads
cels. By wrapping the for loop with the collect wrapper, the main
task waits for all additions to be finished before it checks for errors. This is equivalent to the CUDA library function cudaDeviceSynchronize
.
The additions are done by the function
add
. Instead of doing an addition for one matrix element, it runs a for
loop to handle N/NUMTHREADS
items. These items are not consecutive. For example, the add
task running on the cel threads[0]
will handle item 0 then item NUMTHREADS then item NUMTHREADS*2, then item NUMTHREADS*3, and so on.
The
add
task uses, as its starting index into the matrix array, the index of the cel threads
it is running on. It finds this information by first locating the hosting cel via the pel<location>
, and then finding the tag of the cel via mythread<tag>
. The tag value is a string and must be converted to an integer to be used as a numerical index. For an array of cels, such as <cel> threads[NUMTHREADS]
, the default tag value goes from the string "0" to the string for NUMTHREADS-1. In other environments, the tag value may be a non-numerical string to reflect the underlying cataloging scheme.
The matrixes
x
and y
are shared entities accessible by all add
tasks. In the CUDA environment, they are hosted in the Unified Memory. This space is represented by the cel UnifiedMemory
used in the declaration of the mel variables x
and y
. A NERWous C translator for the CUDA environment would translate those definitions into the CUDA cudaMallocManaged
and other related library functions.
Previous Next Top