Welcome
» NERWous C
» Examples
Current Literature
Recent computer design has a general-purpose central processing unit (CPU) and a dedicated graphics processing unit (GPU). Due to its need to render graphics, a GPU crunches numbers at speed of several magnitude over a CPU. A program can therefore offloads the heavy mathematical computation part to the GPU and retains the logical processing part at the CPU.
As an example of this workload division, we will do a matrix addition:
An example of offloading matrix addition to a GPU is described in C++ - A Code-Based Introduction to C++ AMP. C++ Accelerated Massive Parallelism (C++ AMP) is Microsoft solution to write a program such that portions of it can be compiled and executed on data-parallel hardware. This is the C++ AMP code for matrix addition:
NERWous C Sample
Let's write the matrix addition in NERWous C. The additions will be run in parallel on a cel location called
In the
The
How is the cel
The programmer can associate a different configuration to the NERWous C program, such as one where the cel
Previous Next Top
Current Literature
Recent computer design has a general-purpose central processing unit (CPU) and a dedicated graphics processing unit (GPU). Due to its need to render graphics, a GPU crunches numbers at speed of several magnitude over a CPU. A program can therefore offloads the heavy mathematical computation part to the GPU and retains the logical processing part at the CPU.
As an example of this workload division, we will do a matrix addition:
Two matrices must have an equal number of rows and columns to be added. The sum of two matrices A and B will be a matrix which has the same number of rows and columns as do A and B. The sum of A and B, denoted A + B, is computed by adding corresponding elements of A and B.
An example of offloading matrix addition to a GPU is described in C++ - A Code-Based Introduction to C++ AMP. C++ Accelerated Massive Parallelism (C++ AMP) is Microsoft solution to write a program such that portions of it can be compiled and executed on data-parallel hardware. This is the C++ AMP code for matrix addition:
#include // C++ AMP header file
using namespace concurrency; // Save some typing :)
using std::vector; // Ditto. Comes from brought in by amp.h
int main()
{
// Rows and columns for matrix
const int M = 1024;
const int N = 1024;
// Create storage for a matrix of above size
vector vA(M * N);
vector vB(M * N);
// Populate matrix objects
int i = 0;
std::generate(vA.begin(), vA.end(), [&i](){return i++;});
std::generate(vB.begin(), vB.end(), [&i](){return i--;});
// Output storage for matrix calculation
vector vC(M * N);
perform_calculation(vA, vB, vC, M, N);
}
void perform_calculation(
vector& vA, vector& vB, vector& vC, int M, int N)
{
extent<2> e(M, N);
array_view a(e, vA), b(e, vB);
array_view c(e, vC);
index<2> idx(0, 0);
parallel_for_each(e, [=](index<2> idx) restrict(amp)
{
c[idx] = a[idx] + b[idx];
}
}
The description of the C++ AMP specific constructs -- extent
, array_view
, index
and parallel_for_each
-- can be read from the reference. Here we are interested in the restrict
construct. C++ AMP supports restrict(cpu)
for code to be run on the main CPU. This is the default. For code to be offloaded to a hardware accelerator, it is annotated with restrict(amp)
. During compilation time, the annotated code will be checked against the instruction set that such accelerator can support.
NERWous C Sample
Let's write the matrix addition in NERWous C. The additions will be run in parallel on a cel location called
GPU
. Since the input matrixes, vA
and vB
, are to be accessed by multiple tasks, they are declared as mel shared variables. Since, for matrix addition, each element of the input matrixes are added together, vA
and vB
are declared as mel arrays to facilitate individual access.
extern <cell> GPU
#define M 1024
#define N 1024
int main()
{
// Create matrix objects in shared storage
<mel> int vA[M][N]; // input
<mel> int vB[M][N]; // input
<mel> int vC[M][N]; // output
// Populate matrix objects serially
int acount = 0;
int bcount = M*N;
for (int i = 0; i < M; i++)
{
for (int j = 0 ; i < N; j++)
{
<?>vA[i][j] = acount++;
<?>vB[i][j] = bcount--;
}
}
perform_calculation(vA, vB, vC, M, N);
}
void perform_calculation(
<mel> int vA[][], <mel> int vB[][], <mel> int vC[][], int M, int N)
{
// Compute additions in parallel on the GPU
for (int i = 0; i < M; i++)
<collect> {
for (int j = 0 ; i < N; j++)
<! at=GPU> {
<?>vC[i][j] = <?>vA[i][j] + <?>vB[i][j];
}
} <? ENDED>; /* make sure that all tasks have ended */
}
The population of the matrix object is done serially to guarantee that the local variables acount
and bcount
are respectively incremented and decremented sequentially.
In the
perform_calculation
function, we use two for
loops to iterate through all the elements of the matrix objects. The first for
loop runs on the main CPU using whatever supported thread facility. The purpose of this loop is only to invoke the inner loops. The inner for
loops have all the tasks running on the GPU cel since they are doing "computational heavy" arithmetic operation. (The addition operation here is of course not "computational heavy", but you get the idea.)
The
perform_calculation
function embeds the for
loops inside a collect-ENDED wrapper so that it can wait for all the tasks it pels to have ended before it does a function return.
How is the cel
GPU
defined? It is defined as extern
in the beginning of the NERWous program. This means that its definition comes from the configuration file associated with the NERWous program. In a truly "Accelerated Massive Parallelism" environment, this definition will point to the installed GPU, and the NERWous program will be compiled with a corresponding AMP library and a pre-processor that verifies that all the NERWous C code assigned to the cel GPU
can be executed on the real GPU.
The programmer can associate a different configuration to the NERWous C program, such as one where the cel
GPU
is a web site with SOAP access. A different library and pre-processor will be used in this case, but the NERWous C program will stay the same.
Previous Next Top
No comments:
Post a Comment