Pages

Tuesday, January 24, 2017

Pel Basic

Welcome » NERWous C » Pel
  1. Pel Construct
  2. Pel Code Blocks
  3. Pel Life Cycle
  4. Pel Variables
  5. Synchronous Task Creation
  6. Pel Operations
  7. Pel Properties
  8. Pel Exceptions


Pel Construct

Let's repeat the Producer / Consumer example we first encounter with the introduction to the NERWous C mel construct. First, let's review the serial program written in standard C language, with a while loop of Produce and Consume until no more products (with Produce returning 0):
main () {
     int c;
     while (c = Produce())
           Consume(c);
}
This is the NERWous C version as we have seen before:
/* VERSION 1 */
main () {
    <mel> int store;
    <!> Producer (store);
    <!> Consumer (store);
}
void Producer (<mel> int store) {
    while ( <?>store = Produce() );
}
void Consumer (<mel> int store) {
    while ( int c = <?>store )
        Consume(c);
}
The main task creates two parallel tasks, Producer and Consumer, and the mel variable store as the communication channel between the two. The task creation is done via the pel construct, <!>.

The pel construct is both a compile-time and run-time artifact. During compilation, the NERWous C translator will replace the pel construct with code native to the targeted platform. When the compiled program is run, the CHAOS runtime environment will execute this code to have the task created as a thread, job, remote process, web service or whatever execution unit the targeted platform supports. The created task then runs the code indicated by the pel statement. In the example above, it is the functions Producer and Consumer.

Under NERWous C, a task creation is asynchronous. A pel statement is a request sent to CHAOS to have a task created. As long as CHAOS acknowledges this request, the pel statement can return to the caller. Depending on the platform and implementation, CHAOS may create the task right away or schedules the creation at a later time. The benefit of an asynchronous task creation is that it optimizes the overall performance of the program, especially if the execution unit of the targeted platform is "heavy-weighted", requiring substantial resources to set up.

Synchronous task creations are also available by using pel variables.


Pel Code Blocks

In VERSION 1 above, the pelled code is encapsulated in a function. This does not have to be so. Any code block can be tagged with the pel construct <!>, in order to be assigned to a task and run in parallel:
/* VERSION 2 */
main () {
    <mel> int store;

    <!> {    // Inline Producer
        while ( <?>store = Produce() );
    }

    <!> {    // Inline Consumer
        while ( int c = <?>store )
            Consume(c);
    }
}
VERSION 2 inline programming style is suitable mostly for small blocks of code. For complicated programs, the distinct separation offered by a function provides a friendlier way to identify the various tasks.

The creation of the pelled code block is also asynchronous. Once the main task has registered its pel request for the producer code block with the CHAOS runtime, it continues by registering the pel request for the consumer code block. It does not wait for either pelled code blocks to finish.

Besides stand-alone code blocks, the pel construct can be applied to for and while loops.


Pel Life Cycle

A task is open for execution when the CHAOS runtime creates it upon the pel command <!> request. It runs in the assigned computer element (cel) until it is ended. When all the tasks in the program have ended, the NERWous C program exits.

There are several ways for a task to end:
  1. A task runs out of things to do by falling off the last code statement.
     
  2. A task invokes the return or <return> statement, to release an optional value to the calling task before ending itself.
     
  3. A task invokes the <end> statement to end itself with an exception.
     
  4. A task can be terminated by another task. The terminated task can capture this exception, and ends itself orderly.
     
  5. A task can be killed by another task. The killed task ends abruptly without the chance for an orderly ending.
     
  6. A task can be aborted due to an unhandled exception or code crash.


Pel Variables

A pel variable allows the pelling task to monitor and act on its pelled task. Let's modify VERSION 1 of the Producer/Consumer example to make use of a pel variable:
/* VERSION 3 */
main () {
    <mel> int store;
    <pel>prod = <!>Producer (store);
    <!>Consumer (store);
}
void Producer (<mel> int store) {
    while ( <?>store = Produce() );
}
void Consumer (<mel> int store) {
    while ( int c = <?>store )
        Consume(c);
}
A pel variable is introduced with the <pel> construct. It is a local variable residing with the creating parent task. It contains localized information about the child task during creation, execution and ending.

There is a significant difference between VERSION 1 and VERSION 3. In VERSION 1, both tasks are created asynchronously. The main task pels the Producer task and immediately issues the pel request for the Consumer task -- it does not wait to see if the Producer task has been properly created or not. This results in two eventualities:
  1. The Producer task may fail to be created, due to lack of resources or other reasons, without the main task knowing about it.
  2. The Consumer task may be created and runs before the Producer task. This could happen if the CHAOS runtime were to assign Consumer to a faster processing unit, or implement a Last-In-First-Out scheduling scheme.
These two eventualities can be resolved by assigning a pel variable to the pel statement to force a synchronous task creation, as seen in VERSION 3.


Synchronous Task Creation

Assigning a pel variable to a pel statement transforms the asynchronous task creation into a synchronous one. This is because the pel variable must be initialized with the result of the task creation. In order to do that, the CHAOS runtime environment has to finish creating the task first to collect the required information.

In VERSION 3, the main task issues a pel request to create Producer and waits for it to be created via the pel variable prod. To know the result of this operation, it checks the status of the pel variable:
/* VERSION 4 */
main () {
    <mel> int store;
    <pel>prod = <!>Producer (store);
    if ( prod<status> == NERW_STATUS_FAILED ) {
        printf ("Producer failed to be created due to %s", prod<why>);
        exit (1);   // exit without invoking Consumer
    }
    <!>Consumer (store);
}
The pel variable prod provides an attachment point for the pel operation to report the result of its attempt to create a task to run the pelled code. In a distributed wide-area-network environment (such as the Internet), the main task has to wait for the remote cel site to report if the task has been successfully created there or not.

Since the creation of the Producer task is synchronous, it is guaranteed to be created before the Consumer task. In this chapter's Producer/Consumer examples, it does not matter if Consumer were to run before Producer because it will wait for the mel store to be valued with a product anyway. However in other cases, the sequence of task creations may be significant, such as in this mel example. In this and similar cases, the use of pel variable assignments is needed.


Pel Operations

A pel operation is a request that a parent task sends to the CHAOS runtime to affect its child task. In NERWous C, a pel operation is specified within the < > markers, and placed before a pel variable representing the created child task. The exception is the pel creation operation where the optional pel variable is the left-hand-side value of an assignment. An operation can contain attributes to fine tune the request.

These are the pel operations and their attributes, using the pel variable p as an example:
OperationAttributesSynopsis
<!>code at
=cel
Request that code runs on the specific cel
import
=list-of-vars
Transfer the values of specific global variables to the created task
import-file
=list-of-files
Transfer the values of specific global variables in the listed files to the created task
mode
=setting
Request that the task be created with this mode: suspend (create then suspend execution), quickstart (create faster than normal), maintenance (create in maintenance mode for diagnostic messages), running (create and wait for the task to run)
name
=usn
Assign a unique user-specified name so the task can be referred to without knowing the system-generated task ID.
priority
=n
Request that code runs at the n-th priority level
timeout
=t
(Synchronous mode only) Request that the task be created within t msec.
<start>p Start a suspended task.
<suspend>p Suspend a running task.
<update>p name=usn Update the local properties of task p. If the name attribute is specified, update p from the task named with the specified user-specified name.
<terminate>p Request that the task p be terminated.
<kill>p Force the task p to end immediately.



Pel Properties

A property of a pel variable is a localized value that the parent task caches from a current or past interaction with its child task. In NERWous C, a property is specified in lower case within the < > markers, and placed after a pel variable.

These are the pel properties of the pel variable p:
PropertySynopsis
p<location>Cached value of the cel the task runs on
p<priority>Cached value of the task priority
p<status>Status of the task from the last update operation
p<why>Reason for the task fails

A pel property value is only valid from the last interaction between the parent task and the child task. For example, the child task may have terminated but its status property at the parent task may still say "running", until the parent task does an update operation or interacts in other ways with the child task.


Pel Exceptions

A pel operation can fail or the child task can fail. In NERWous C, such failure generates a pel exception. A pel exception is specified within the < > markers, and placed after a pel variable, like a pel property. Unlike pel properties which are specified in lower case, pel exceptions names are upper cased.

A programmer can encapsulate a pel operation with a try/catch construct to capture the pel exceptions. These are the pel exceptions:
ExceptionSynopsis
p<FAILED>The task has ended abnormally
p<ENDED>The task has ended normally
p<TIMEOUT>The task creation request has timed out

If a pel exception has fired, the pel property <status> will contain the name of the exception:
/* VERSION 5 */
main () {
    <mel> int store;
    try {
        <pel>prod = <!>Producer (store);
    }
    catch (prod<FAILED>) {
        printf ("Producer failed to be created due to %s", prod<why>);
        printf ("The job status [%s] should be FAILED", prod<status>);
        exit (1);   // exit without invoking Consumer
    }
    <!>Consumer (store);
}



Previous Next Top

Sunday, January 22, 2017

GPU Accelerator Offloading

Welcome » NERWous C » Examples
  1. Current Literature
  2. NERWous C Sample


Current Literature

Recent computer design has a general-purpose central processing unit (CPU) and a dedicated graphics processing unit (GPU). Due to its need to render graphics, a GPU crunches numbers at speed of several magnitude over a CPU. A program can therefore offloads the heavy mathematical computation part to the GPU and retains the logical processing part at the CPU.

As an example of this workload division, we will do a matrix addition:
Two matrices must have an equal number of rows and columns to be added. The sum of two matrices A and B will be a matrix which has the same number of rows and columns as do A and B. The sum of A and B, denoted A + B, is computed by adding corresponding elements of A and B.

An example of offloading matrix addition to a GPU is described in C++ - A Code-Based Introduction to C++ AMP. C++ Accelerated Massive Parallelism (C++ AMP) is Microsoft solution to write a program such that portions of it can be compiled and executed on data-parallel hardware. This is the C++ AMP code for matrix addition:
#include                 // C++ AMP header file
using namespace concurrency;    // Save some typing :)
using std::vector;     // Ditto. Comes from  brought in by amp.h

int main()
{
  // Rows and columns for matrix
  const int M = 1024;
  const int N = 1024;

  // Create storage for a matrix of above size
  vector vA(M * N);
  vector vB(M * N);

  // Populate matrix objects
  int i = 0;
  std::generate(vA.begin(), vA.end(), [&i](){return i++;});
  std::generate(vB.begin(), vB.end(), [&i](){return i--;});

  // Output storage for matrix calculation
  vector vC(M * N);

  perform_calculation(vA, vB, vC, M, N);
}
void perform_calculation(
     vector& vA, vector& vB, vector& vC, int M, int N)
{
   extent<2> e(M, N);
   array_view a(e, vA), b(e, vB);
   array_view c(e, vC);

   index<2> idx(0, 0);
   parallel_for_each(e, [=](index<2> idx) restrict(amp)
   {
       c[idx] = a[idx] + b[idx];
   }
}
The description of the C++ AMP specific constructs -- extent, array_view, index and parallel_for_each -- can be read from the reference. Here we are interested in the restrict construct. C++ AMP supports restrict(cpu) for code to be run on the main CPU. This is the default. For code to be offloaded to a hardware accelerator, it is annotated with restrict(amp). During compilation time, the annotated code will be checked against the instruction set that such accelerator can support.


NERWous C Sample

Let's write the matrix addition in NERWous C. The additions will be run in parallel on a cel location called GPU. Since the input matrixes, vA and vB, are to be accessed by multiple tasks, they are declared as mel shared variables. Since, for matrix addition, each element of the input matrixes are added together, vA and vB are declared as mel arrays to facilitate individual access.
extern <cell> GPU
#define M 1024
#define N 1024
int main() 
{
  // Create matrix objects in shared storage
  <mel> int vA[M][N];    // input
  <mel> int vB[M][N];    // input
  <mel> int vC[M][N];    // output

  // Populate matrix objects serially
  int acount = 0;
  int bcount = M*N;
  for (int i = 0; i < M; i++)
  {
     for (int j = 0 ; i < N; j++) 
     {
         <?>vA[i][j] = acount++;
         <?>vB[i][j] = bcount--;
     }
   }
  
  perform_calculation(vA, vB, vC, M, N);
}

void perform_calculation(
     <mel> int vA[][], <mel> int vB[][], <mel> int vC[][], int M, int N)
{
    // Compute additions in parallel on the GPU
    for (int i = 0; i < M; i++) 
    <collect> {
       for (int j = 0 ; i < N; j++) 
       <! at=GPU> {
          <?>vC[i][j] = <?>vA[i][j] + <?>vB[i][j];
       }
    } <? ENDED>;  /* make sure that all tasks have ended */
}
The population of the matrix object is done serially to guarantee that the local variables acount and bcount are respectively incremented and decremented sequentially.

In the perform_calculation function, we use two for loops to iterate through all the elements of the matrix objects. The first for loop runs on the main CPU using whatever supported thread facility. The purpose of this loop is only to invoke the inner loops. The inner for loops have all the tasks running on the GPU cel since they are doing "computational heavy" arithmetic operation. (The addition operation here is of course not "computational heavy", but you get the idea.)

The perform_calculation function embeds the for loops inside a collect-ENDED wrapper so that it can wait for all the tasks it pels to have ended before it does a function return.

How is the cel GPU defined? It is defined as extern in the beginning of the NERWous program. This means that its definition comes from the configuration file associated with the NERWous program. In a truly "Accelerated Massive Parallelism" environment, this definition will point to the installed GPU, and the NERWous program will be compiled with a corresponding AMP library and a pre-processor that verifies that all the NERWous C code assigned to the cel GPU can be executed on the real GPU.

The programmer can associate a different configuration to the NERWous C program, such as one where the cel GPU is a web site with SOAP access. A different library and pre-processor will be used in this case, but the NERWous C program will stay the same.


Previous Next Top