DocNgu NERW Concurrency Model: GPU Programming with CUDA

Current Literature
NERWous C Sample

Current Literature

CUDA is the parallel computing platform and programming model from NVIDIA. It allows programs to run on the main Central Processing Unit (CPU) and offload heavily mathematical computation tasks to NVIDIA Graphical Processing Units (GPUs). A computer host can have one or more CPUs and one or more GPUs. All these computer resources can be harnessed by the CUDA parallel programming model.

Let's take a look at this CUDA example that does addition of two arrays of 1 million elements:

#include <iostream>
#include <math.h>

#define NUMBLOCKS 4
#define NUMTHREADSPERBLOCK 256

// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; 
       i < n; 
       i += blockDim.x * gridDim.x)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  // on 4 thread blocks with 256 threads per block for a total of 1024 threads
  add<<<NUMBLOCKS, NUMTHREADSPERBLOCK>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

The specifier __global__ to the function ''add'' tells the CUDA C++ compiler that "this is a function that runs on the GPU and can be called from CPU code".

The cudaMallocManaged CUDA function allocates memory for the shared arrays from the Unified Memory space. This memory space is accessible to both the main CPU and the GPU.

The above CUDA example uses a technique called grid-stride loop. Instead of running the add function for 1M items in 1M virtual threads, it runs based on the capability of the hardware on hand. While programming with the unlimited threads assumption results in pretty concise code, in practice, the cost of setting up and tearing down actual processing units to map to virtual threads imposes a performance penalty on the running program. The grid-stride loop strategy identifies the number of available executing units, and then divides the work to these units. In the above CUDA example, NUMBLOCKS * NUMTHREADSPERBLOCK units (or threads in CUDA terminology) divvy up the 1 M addition tasks, with each thread handle 1 M / NUMBLOCKS * NUMTHREADSPERBLOCK additions.

NERWous C Sample

This is the NERWous C sample correspondent:

#define NUMBLOCKS 4
#define NUMTHREADSPERBLOCK 256
#define NUMTHREADS	NUMBLOCKS*NUMTHREADSPERBLOCK

void add(int n, mel float *x, mel float *y)
{
   <cel> mythread = pel<location>;
   int index = atoi(mythread<tag>);
   int stride = NUMTHREADS;
   
   for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

main () {
   int N = 1<<20;    /* 1 M items */
 
   // Allocate Unified Memory – accessible from CPU or GPU
   <mel at=UnifiedMemory> float x[N];
   <mel at=UnifiedMemory> float y[N];
 
   // Initialize x and y arrays on the host
   <?> (x, y) {
      for (int i = 0; i < N; i++) {
         x[i] = 1.0f;
         y[i] = 2.0f;
      }
   }

   // Run kernel on 1M elements on the GPU
   <cel> threads[NUMTHREADS];
   
   <collect> for (int index = 0; index < NUMTHREADS; ++index)
   <! at=threads[index]> {
      add (N, x, y, index, numthreads);
   } <? ENDED>;    // Wait for GPU to finish before accessing on host
   
   // Check for errors (all values should be 3.0f)
   float maxError = 0.0f;
   <?>(y) {
      for (int i = 0; i < N; i++)
         maxError = fmax(maxError, fabs(y[i]-3.0f));
   }
   printf ( "Max error:  %d", maxError);   
   
   // Free memory
   <close> x;
   <close> y;
   
   return 0;
}

The CUDA example identifies NUMBLOCKS*NUMTHREADSPERBLOCK threads. These are defined in NERWous C as cel array variable threads:

<cel> threads[NUMTHREADS];

The for (int index = 0; index < NUMTHREADS; ++index) loop divvies the 1-Million N items into the available threads cels. By wrapping the for loop with the collect wrapper, the main task waits for all additions to be finished before it checks for errors. This is equivalent to the CUDA library function cudaDeviceSynchronize.

The additions are done by the function add. Instead of doing an addition for one matrix element, it runs a for loop to handle N/NUMTHREADS items. These items are not consecutive. For example, the add task running on the cel threads[0] will handle item 0 then item NUMTHREADS then item NUMTHREADS*2, then item NUMTHREADS*3, and so on.

The add task uses, as its starting index into the matrix array, the index of the cel threads it is running on. It finds this information by first locating the hosting cel via the pel<location>, and then finding the tag of the cel via mythread<tag>. The tag value is a string and must be converted to an integer to be used as a numerical index. For an array of cels, such as <cel> threads[NUMTHREADS], the default tag value goes from the string "0" to the string for NUMTHREADS-1. In other environments, the tag value may be a non-numerical string to reflect the underlying cataloging scheme.

The matrixes x and y are shared entities accessible by all add tasks. In the CUDA environment, they are hosted in the Unified Memory. This space is represented by the cel UnifiedMemory used in the declaration of the mel variables x and y. A NERWous C translator for the CUDA environment would translate those definitions into the CUDA cudaMallocManaged and other related library functions.

Previous Next Top

DocNgu NERW Concurrency Model

Pages

Wednesday, June 20, 2018

GPU Programming with CUDA

No comments:

Post a Comment