Write code in C/C++, Java, Python, Fortran, Perl...
You do not need parallel programming experience
Requires no knowledge of graphics APIs
Access to native instructions and memory
Get started - CUDA Toolkit
Driver
Compiler nvcc
Development, profiling and debugging tools
Various libraries
Programming guides, and API reference
Example codes
Get started
CUDA C Programming Guide
The advent of multicore CPUs and many-core GPUs means that mainstream processor chips are now
parallel
systems. Furthermore, their parallelism continues to scale with Moore’s law. The challenge is to
develop
application software that transparently scales its parallelism to leverage the increasing
number
of
processor cores...
At its core are three key abstractions – a hierarchy of thread groups, shared memories, and
barrier
synchronization...
...data parallelism...
CUDA terminology
Heterogeneous programming
Serial code → Host
Parallel code → Device
Kernels
C functions, that when called, are executed N times in parallel by N different threads on the device.
As
opposed to only once, like a regular C functions.
Thread hierarchy
A kernel is executed by thousands of threads in parallel, organized as a hierarchy.
dim3 is a structure with three properties, x, y and z.
How to determine unique thread IDs?
With this execution configuration
dim3 dimGrid(3); // 3 blocks in 1D
dim3 dimBlock (5); // 5 threads per block in 1D
How are kernels executed?
The GPU core is the stream processor (SP)
Able to run a single sequential thread
SPs are grouped into streaming multiprocessors (SMs)
Can execute hundreds of threads concurrently
SMs is basically a SIMD processor
There are multiple SMs per GPU
NVIDIA's consumer graphics cards
How are kernels executed - Grid
Grid → GPU
An entire grid is handled by a single GPU chip
The GPU is responsible for allocating blocks to SMs that has available capacity
How are kernels executed - Block
Block → Streaming multiprocessor
A block is never divided across multiple streaming multiprocessors
How are kernels executed - Thread
Thread → Stream processor
Each stream processor handles one or more threads in a block
Synchronization
Point in the program where threads stop and wait
When all threads have reached the barrier, they can proceed
Inside kernel
__syncthreads() // Synchronize threads within a block
Memory hierarchy
Memory management
Inside kernel
float variable; // Local memory (registers)
__shared__ float variable; // Shared memory
__device__ float variable; // Global memory
Memory management
Allocate GPU memory:
cudaMalloc(......)
Copy data to/from GPU:
cudaMemcpy(......, cudaMemcpyHostToDevice)
cudaMemcpy(......, cudaMemcpyDeviceToHost)
Free GPU memory:
cudaFree(......)
CUDA C Programming Guide...
At its core are three key abstractions – a hierarchy of thread groups, shared memories, and
barrier
synchronization...
CUDA example - Vector addition
Typical program execution
Allocate memory and initialize data on CPU
Allocate memory on GPU
Transfer data from CPU to GPU
A slow operation, aim to minimize this!
Lunch kernel
Transfer results back from GPU to CPU
Free CPU and GPU memory
The problem to solve
C = A + B
Vector addition on CPU
for(int i=0; i<N-1; i++) {
C[i] = A[i] + B[i];
}
Only one thread of execution
No explicit parallelism
Allocate memory on CPU
int main() {
// Size of vectors
int n = 50000;
// Size, in bytes, of each vector
size_t bytes = n * sizeof(float);
// Allocate memory for the host vectors
float *h_A = (float *) malloc(bytes);
float *h_B = (float *) malloc(bytes);
float *h_C = (float *) malloc(bytes);
// Copy the host vectors to the device
cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice);
Lunch kernel
// Execution configuration
int threadsPerBlock = 256;
int blocksPerGrid =(n + threadsPerBlock - 1) / threadsPerBlock;
// Launch kernel
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B,
d_C, n);
printf("Kernel launch with %d blocks of %d threads\n",
blocksPerGrid, threadsPerBlock);
Transfer results back from GPU to CPU
// Copy the result back to the host result vector
cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost);
Show the result
// Sum up host result vector and print result divided by n,
// this should equal 4
float sum = 0;
for(int i=0; i<n; i++) {
sum += h_C[i];
}
sum = sum / n;
printf("Final result: %f\n", sum);
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void vectorAdd(float *d_A, float *d_B, float *d_C, int n) {
// Calculate the thread ID
int i = blockIdx.x * blockDim.x + threadIdx.x;
// Make sure we do not go out of bounds
if (i < n) {
d_C[i] = d_A[i] + d_B[i];
}
}
Visualized
Example: Vector addition on GPU
$ Kernel launch with 196 blocks of 256 threads
$ Final result: 4.000000
Execution time CPU vs. GPU - Assume we do 64 additions
CPU uses 2 ns for 1 addition. CPU execution time = number of additions * time it takes for 1 addition = 64 * 2 ns = 128 ns
GPU uses 10 ns for 1 addition. GPU execution time = the time it takes for 1 addition (we assume we have enough resources to do the
all additions in parallel) = 10 ns
Final comments
Performance guidelines
High arithmetic intensity: Math (maximize) / memory (minimize)
High number of threads: Keep the GPU busy. Need enough threads to hide memory latency
Avoid thread divergence (caused by if, switch, do, for, while statements): Different execution
paths -> serialisation (SIMT)
Large data sets: Plenty to operate on in parallel
Minimize CPU-GPU memory transfers: Slow operation
Use single precision if possible
...
Why use GPUs?
Computing power. It is powerful
Number crunching: 1 GPU ~= 4 TFLOPS ~= Small cluster
It is a cheap commodity hardware
FLOPS per NOK
It is everywhere
Sold hundreds of millions programmable GPUs
Power
FLOPS per Watt
NVIDIA - Tesla cards
No graphic output! What?
Aimed at scientific computing rather than gaming
Tested and burned-in for long-running calculations
GPGPU - Areas
Graphics
Multimedia
Ultrasound Imaging
Molecular Dynamics
Seismic Imaging
Astrophysic
Data Mining
Finance
Physics
Chemistry
...
More Responsibility
GPU programming puts more responsibility on the programmer than "regular" programming
For example the programmer must decide and explicitly code:
How to partition the computation into a grid, blocks, and threads
Where (in what kind of memory) to put each piece of data for best performance
The end!
CPU + GPU = combination of flexibility and performance
GPUs can make your simulations go faster. So that you can not slack off....
Getting started with GPU programming is easy, being able to fully utilize GPU hardware is
hard...