DocNgu NERW Concurrency Model: GPU Accelerator Offloading

Current Literature
NERWous C Sample

Current Literature

Recent computer design has a general-purpose central processing unit (CPU) and a dedicated graphics processing unit (GPU). Due to its need to render graphics, a GPU crunches numbers at speed of several magnitude over a CPU. A program can therefore offloads the heavy mathematical computation part to the GPU and retains the logical processing part at the CPU.

As an example of this workload division, we will do a matrix addition:

Two matrices must have an equal number of rows and columns to be added. The sum of two matrices A and B will be a matrix which has the same number of rows and columns as do A and B. The sum of A and B, denoted A + B, is computed by adding corresponding elements of A and B.

An example of offloading matrix addition to a GPU is described in C++ - A Code-Based Introduction to C++ AMP. C++ Accelerated Massive Parallelism (C++ AMP) is Microsoft solution to write a program such that portions of it can be compiled and executed on data-parallel hardware. This is the C++ AMP code for matrix addition:

#include                 // C++ AMP header file
using namespace concurrency;    // Save some typing :)
using std::vector;     // Ditto. Comes from  brought in by amp.h

int main()
{
  // Rows and columns for matrix
  const int M = 1024;
  const int N = 1024;

  // Create storage for a matrix of above size
  vector vA(M * N);
  vector vB(M * N);

  // Populate matrix objects
  int i = 0;
  std::generate(vA.begin(), vA.end(), [&i](){return i++;});
  std::generate(vB.begin(), vB.end(), [&i](){return i--;});

  // Output storage for matrix calculation
  vector vC(M * N);

  perform_calculation(vA, vB, vC, M, N);
}
void perform_calculation(
     vector& vA, vector& vB, vector& vC, int M, int N)
{
   extent<2> e(M, N);
   array_view a(e, vA), b(e, vB);
   array_view c(e, vC);

   index<2> idx(0, 0);
   parallel_for_each(e, [=](index<2> idx) restrict(amp)
   {
       c[idx] = a[idx] + b[idx];
   }
}

The description of the C++ AMP specific constructs -- extent, array_view, index and parallel_for_each -- can be read from the reference. Here we are interested in the restrict construct. C++ AMP supports restrict(cpu) for code to be run on the main CPU. This is the default. For code to be offloaded to a hardware accelerator, it is annotated with restrict(amp). During compilation time, the annotated code will be checked against the instruction set that such accelerator can support.

NERWous C Sample

Let's write the matrix addition in NERWous C. The additions will be run in parallel on a cel location called GPU. Since the input matrixes, vA and vB, are to be accessed by multiple tasks, they are declared as mel shared variables. Since, for matrix addition, each element of the input matrixes are added together, vA and vB are declared as mel arrays to facilitate individual access.

extern <cell> GPU
#define M 1024
#define N 1024
int main() 
{
  // Create matrix objects in shared storage
  <mel> int vA[M][N];    // input
  <mel> int vB[M][N];    // input
  <mel> int vC[M][N];    // output

  // Populate matrix objects serially
  int acount = 0;
  int bcount = M*N;
  for (int i = 0; i < M; i++)
  {
     for (int j = 0 ; i < N; j++) 
     {
         <?>vA[i][j] = acount++;
         <?>vB[i][j] = bcount--;
     }
   }
  
  perform_calculation(vA, vB, vC, M, N);
}

void perform_calculation(
     <mel> int vA[][], <mel> int vB[][], <mel> int vC[][], int M, int N)
{
    // Compute additions in parallel on the GPU
    for (int i = 0; i < M; i++) 
    <collect> {
       for (int j = 0 ; i < N; j++) 
       <! at=GPU> {
          <?>vC[i][j] = <?>vA[i][j] + <?>vB[i][j];
       }
    } <? ENDED>;  /* make sure that all tasks have ended */
}

The population of the matrix object is done serially to guarantee that the local variables acount and bcount are respectively incremented and decremented sequentially.

In the perform_calculation function, we use two for loops to iterate through all the elements of the matrix objects. The first for loop runs on the main CPU using whatever supported thread facility. The purpose of this loop is only to invoke the inner loops. The inner for loops have all the tasks running on the GPU cel since they are doing "computational heavy" arithmetic operation. (The addition operation here is of course not "computational heavy", but you get the idea.)

The perform_calculation function embeds the for loops inside a collect-ENDED wrapper so that it can wait for all the tasks it pels to have ended before it does a function return.

How is the cel GPU defined? It is defined as extern in the beginning of the NERWous program. This means that its definition comes from the configuration file associated with the NERWous program. In a truly "Accelerated Massive Parallelism" environment, this definition will point to the installed GPU, and the NERWous program will be compiled with a corresponding AMP library and a pre-processor that verifies that all the NERWous C code assigned to the cel GPU can be executed on the real GPU.

The programmer can associate a different configuration to the NERWous C program, such as one where the cel GPU is a web site with SOAP access. A different library and pre-processor will be used in this case, but the NERWous C program will stay the same.

Previous Next Top

DocNgu NERW Concurrency Model

Pages

Sunday, January 22, 2017

GPU Accelerator Offloading

No comments:

Post a Comment