Introduction to GPU programming

29.08.2013

Eirik Ola Aksnes

Agenda

  • What are GPUs?
  • Where can GPUs be found?
  • Historical motivation - the video game industry
  • GPGPU - general-purpose computing on GPUs
  • CPU vs. GPU
  • Data parallelism
  • Nvidia CUDA
  • CUDA example - Vector addition
  • Final comments

What are GPUs?

  • GPUs are highly parallel, multithreaded, many-core processors
  • Hundreds of cores
  • Thousands of concurrent threads
  • First GPU - Nvidia's GeForce 256 (1999)
  • Vendors - NVIDIA, AMD, Intel

Where can GPUs be found?

  • Mobile phones
  • Personal computers
  • Clusters
  • Supercomputers
  • Game consoles

Historical motivation - the video game industry

Constantly pushes to improve the ability to perform massive numbers of floating point calculations in video games (multi-billion dollar industry!)

1981 - 3D Monster Maze

2013 - Tom Clancy’s The Division

GPGPU - General-purpose computing on GPUs

  • GPUs can be used for more than just graphic rendering
  • GPUs can accelerate applications in a variety of disciplines
    • Big speedup compared to CPUs in some cases
  • In the beginning graphic APIs, like OpenGL, was used to do GPGPU (big hack)
    • Difficult to develop, debug, and optimize
  • Improvements in hardware and software has made GPGPU easier
    • Programmability

History of GPGPU computing

CPU vs. GPU

Floating-point performance

Memory bandwidth

Design

CPU GPU
Few complex and speedy cores Many simple and slow cores
Each core has own sophisticated control logic (independent execution) Groups of compute cores share control logic
Memory latency hidden by cache and prefetching Memory latency hidden by swapping threads. Massive multithreading. Very fast context switching

Transistors

  • CPU: Uses large fraction of the chip area for control logic and cache
  • GPU: Uses most of the chip area for data processing units

Latency vs. Throughput

What is better? What do you need?

CPU - Low latency GPU - High Throughput
E.g. deliver a package as soon as possible E.g. deliver many packages within a reasonable time

Data parallelism

  • Multiple processing units performs in parallel the same operation on different data elements
  • Flynn's taxonomy: Single-Instruction Multiple-Data (SIMD)
  • NVIDIA: Single-Instruction Multiple-Threads (SIMT)

Data parallelism - Image example

What is CUDA?

  • Compute Unified Device Architecture
  • Nvidia introduced CUDA in 2006
  • Specially designed for GPGPU
  • Only supported by Nvidia graphics cards
  • Write code in C/C++, Java, Python, Fortran, Perl...
  • You do not need parallel programming experience
  • Requires no knowledge of graphics APIs
  • Access to native instructions and memory

Get started - CUDA Toolkit

  • Driver
  • Compiler nvcc
  • Development, profiling and debugging tools
  • Various libraries
  • Programming guides, and API reference
  • Example codes

Get started

CUDA C Programming Guide

The advent of multicore CPUs and many-core GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore’s law. The challenge is to develop application software that transparently scales its parallelism to leverage the increasing number of processor cores...

At its core are three key abstractions – a hierarchy of thread groups, shared memories, and barrier synchronization...

...data parallelism...

CUDA terminology

Heterogeneous programming

  • Serial code → Host
  • Parallel code → Device

Kernels

C functions, that when called, are executed N times in parallel by N different threads on the device. As opposed to only once, like a regular C functions.

Thread hierarchy

  • A kernel is executed by thousands of threads in parallel, organized as a hierarchy.

Thread hierarchy - Blocks

  • Threads are grouped into blocks (1D, 2D or 3D)
    • Each thread has its own local block ID

Thread hierarchy - Grids

  • Blocks are grouped into a grid (1D, 2D or 3D)
    • Each block as its own ID
    • All blocks in a grid have the same dimension

Thread hierarchy - Scalability

  • Threads in the same block can
    • Be synchronized
    • Share data (shared memory)
  • Threads in different blocks can not cooperate
    • Thread blocks are independent
    • Can be executed in any order -> scalability

How to launch a kernel?

Execution configuration example


dim3 dimGrid(2, 2); // 4 blocks (2D)
dim3 dimBlock(2, 4); // 8 threads per block (2D)
myKernel<<<dimGrid, dimBlock>>>();
            

dim3 is a structure with three properties, x, y and z.

How to determine unique thread IDs?

With this execution configuration


dim3 dimGrid(3); // 3 blocks in 1D
dim3 dimBlock (5); // 5 threads per block in 1D
            

How are kernels executed?

  • The GPU core is the stream processor (SP)
    • Able to run a single sequential thread
  • SPs are grouped into streaming multiprocessors (SMs)
    • Can execute hundreds of threads concurrently
    • SMs is basically a SIMD processor
  • There are multiple SMs per GPU

NVIDIA's consumer graphics cards

How are kernels executed - Grid

  • Grid → GPU
    • An entire grid is handled by a single GPU chip
    • The GPU is responsible for allocating blocks to SMs that has available capacity

How are kernels executed - Block

  • Block → Streaming multiprocessor
    • A block is never divided across multiple streaming multiprocessors

How are kernels executed - Thread

  • Thread → Stream processor
    • Each stream processor handles one or more threads in a block

Synchronization

  • Point in the program where threads stop and wait
  • When all threads have reached the barrier, they can proceed

Inside kernel


    __syncthreads() // Synchronize threads within a block

Memory hierarchy

Memory management

Inside kernel


float variable; // Local memory (registers)
__shared__ float variable; // Shared memory
__device__ float variable; // Global memory
            

Memory management

  • Allocate GPU memory:
    • cudaMalloc(......)
  • Copy data to/from GPU:
    • cudaMemcpy(......, cudaMemcpyHostToDevice)
    • cudaMemcpy(......, cudaMemcpyDeviceToHost)
  • Free GPU memory:
    • cudaFree(......)

CUDA C Programming Guide...

At its core are three key abstractions – a hierarchy of thread groups, shared memories, and barrier synchronization...

CUDA example - Vector addition

Typical program execution

  • Allocate memory and initialize data on CPU
  • Allocate memory on GPU
  • Transfer data from CPU to GPU
    • A slow operation, aim to minimize this!
  • Lunch kernel
  • Transfer results back from GPU to CPU
  • Free CPU and GPU memory

The problem to solve

C = A + B

Vector addition on CPU


for(int i=0; i<N-1; i++) {
    C[i] = A[i] + B[i];
}
  • Only one thread of execution
  • No explicit parallelism

Allocate memory on CPU


int main() {
    // Size of vectors
    int n = 50000;

    // Size, in bytes, of each vector
    size_t bytes = n * sizeof(float);

    // Allocate memory for the host vectors
    float *h_A = (float *) malloc(bytes);
    float *h_B = (float *) malloc(bytes);
    float *h_C = (float *) malloc(bytes);

Initialize vectors with some values


    for(int i=0; i<n; ++i) {
      h_A[i] = 1;
      h_B[i] = 3;
    }

Allocate memory on GPU


    // Allocate memory for the device vectors
    float *d_A = NULL;
    cudaMalloc((void **) &d_A, bytes);
    float *d_B = NULL;
    cudaMalloc((void **) &d_B, bytes);
    float *d_C = NULL;
    cudaMalloc((void **) &d_C, bytes);

Transfer data from CPU to GPU


    // Copy the host vectors to the device
    cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice);

Lunch kernel


    // Execution configuration
    int threadsPerBlock = 256;
    int blocksPerGrid =(n + threadsPerBlock - 1) / threadsPerBlock;

    // Launch kernel
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B,
    d_C, n);

    printf("Kernel launch with %d blocks of %d threads\n",
    blocksPerGrid, threadsPerBlock);

Transfer results back from GPU to CPU


    // Copy the result back to the host result vector
    cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost);

Show the result


    // Sum up host result vector and print result divided by n,
    // this should equal 4
    float sum = 0;
    for(int i=0; i<n; i++) {
        sum += h_C[i];
    }
    sum = sum / n;
    printf("Final result: %f\n", sum);

Free CPU and GPU memory


    // Free device global memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}

The kernel

The for-loop is removed!


#include <stdio.h>
#include <cuda_runtime.h>

__global__ void vectorAdd(float *d_A, float *d_B, float *d_C, int n) {
    // Calculate the thread ID
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    // Make sure we do not go out of bounds
    if (i < n) {
        d_C[i] = d_A[i] + d_B[i];
    }
}

Visualized

Example: Vector addition on GPU


$ Kernel launch with 196 blocks of 256 threads
$ Final result: 4.000000

Execution time CPU vs. GPU - Assume we do 64 additions

  • CPU uses 2 ns for 1 addition. CPU execution time = number of additions * time it takes for 1 addition = 64 * 2 ns = 128 ns
  • GPU uses 10 ns for 1 addition. GPU execution time = the time it takes for 1 addition (we assume we have enough resources to do the all additions in parallel) = 10 ns

Final comments

Performance guidelines

  • High arithmetic intensity: Math (maximize) / memory (minimize)
  • High number of threads: Keep the GPU busy. Need enough threads to hide memory latency
  • Avoid thread divergence (caused by if, switch, do, for, while statements): Different execution paths -> serialisation (SIMT)
  • Large data sets: Plenty to operate on in parallel
  • Minimize CPU-GPU memory transfers: Slow operation
  • Use single precision if possible
  • ...

Why use GPUs?

  • Computing power. It is powerful
    • Number crunching: 1 GPU ~= 4 TFLOPS ~= Small cluster
  • It is a cheap commodity hardware
    • FLOPS per NOK
  • It is everywhere
    • Sold hundreds of millions programmable GPUs
  • Power
    • FLOPS per Watt

NVIDIA - Tesla cards

  • No graphic output! What?
  • Aimed at scientific computing rather than gaming
  • Tested and burned-in for long-running calculations

GPGPU - Areas

  • Graphics
  • Multimedia
  • Ultrasound Imaging
  • Molecular Dynamics
  • Seismic Imaging
  • Astrophysic
  • Data Mining
  • Finance
  • Physics
  • Chemistry
  • ...

More Responsibility

  • GPU programming puts more responsibility on the programmer than "regular" programming
  • For example the programmer must decide and explicitly code:
    • How to partition the computation into a grid, blocks, and threads
    • Where (in what kind of memory) to put each piece of data for best performance

The end!

  • CPU + GPU = combination of flexibility and performance
  • GPUs can make your simulations go faster. So that you can not slack off....
  • Getting started with GPU programming is easy, being able to fully utilize GPU hardware is hard...
  • CPU optimizations does not apply to GPU
  • Memory movement is very expensive

References