GPU computing has transformed industries, enabling applied deep learning advancements in autonomous vehicles, robotics, and molecular biology. The high-speed parallel processing capabilities offered by these machines accelerate the matrix multiplication computations required for processing and transforming massive amounts of data to train and make predictions (inference) with deep learning models comprised of layers of interconnected nodes (neural networks).
Training these neural networks and performing inference faster and cheaper is a high priority in AI research and development. With respect to GPU computing, this means understanding how to better optimize GPU performance.
Familiarity with the following will help with understanding the topics presented in this article:
The goal of this article is to give readers the insight they need to improve their computing experience. Those keen on optimizing GPU performance are advised to learn about the features of the latest GPU architectures, understand the GPU programming language landscape, and gain familiarity with performance monitoring tools like NVIDIA Nsight and SMI. Experimenting, benchmarking, and iterating through GPU optimizations are critical for achieving better utilization of the hardware.
Knowledge of the intricacies of GPU architectures can improve your intuition around programming massively parallel processors. Throughout successive GPU iterations, NVIDIA has introduced a number of specialized hardware features to accelerate its parallel processing capabilities.
By default, many deep learning libraries (ex: PyTorch) train with single precision (FP32). However, single precision isn’t always necessary for achieving optimal accuracy. Lower precision requires less memory, increasing the speed at which data can be accessed (memory bandwidth).
Tensor Cores enable mixed-precision computing, where FP32 is used only when necessary and the lowest precision data type that doesn’t compromise accuracy is used. There are currently five generations of Tensor Cores, with the fourth generation in the Hopper architecture and the fifth generation in the Blackwell architecture.
Tensor Core | Data Type introduced |
---|---|
Volta (first generation) | FP16, FP32 |
Ampere (third generation) | Sparsity, INT8, INT4, FP64, BF16, TF32 |
Hopper (fourth generation) | FP8 |
Blackwell (fifth generation) | FP4 |
The Transformer Engine is a library that allows for 8-bit floating point (FP8) precision on Hopper GPUs. The introduction of FP8 precision in Hopper GPUs improved performance over FP16 without compromising accuracy. The second-generation Transformer Engine will be in the Blackwell architecture, allowing for FP4 precision.
The Tensor Memory Accelerator (TMA) allows for asynchronous memory transfer between the GPU’s global and shared memory. Prior to the TMA, multiple threads and warps would work together to copy data. In contrast, with the TMA, a single thread in the thread block can issue a TMA instruction for asynchronous handling of the copy operation.
Now, consider this: does hardware design influence the CUDA language? Or does the CUDA language motivate hardware design? Both are true. This relationship between hardware and software is well-described in the 2022 GTC talk, How CUDA Programming Works where Stephen Jones explains that the CUDA language evolved to make the physics of the hardware more programmable.
CUDA (Compute Unified Device Architecture) is a parallel computing platform designed to configure GPUs. CUDA supports C, C++, Fortran, and Python among other programming languages.
There are a multitude of libraries built on top of CUDA to extend its functionality. Some notable ones include:
Triton is a python-based language and compiler for parallel programming. Phil Tillet, the creator of Triton, explains in this video that the language was designed to address the limitations of GPU programming with respect to CUDA and existing domain-specific languages (DSLs).
While highly effective, CUDA is often a bit too complex to just jump into for researchers and practitioners without specialized GPU programming experience. This complexity not only impedes communication between GPU experts and ML researchers, but also hinders the rapid iteration required to accelerate development in compute-intensive fields.
Additionally, existing DSLs are restrictive. They lack support for custom data structures and control over parallelization strategies and resource allocation.
Triton strikes a balance by allowing its users to define and manipulate tensors in SRAM and modify them with the use of torch-like operators, while still providing the flexibility to implement custom parallelization and resource management strategies. Triton helps democratize GPU programming by making it possible to write efficient GPU code without extensive CUDA experience.
GPUs possess multiple memory types, with different sizes and speeds. The inverse relationship between memory size and speed is the basis behind the GPU memory hierarchy. Strategically allocating variables to different CUDA memory types gives developers more control over their program’s performance. The specified memory type impacts the variable’s scope (confined to a single thread, shared within thread blocks, etc.) and the speed at which it’s accessed. Variables stored in high-speed memory like registers or shared memory can be more quickly retrieved than variables stored in slower memory types such as global memory.
FlashAttention is an example of a hardware-aware algorithm that exploits the memory hierarchy.
Performance evaluation in GPU computing depends on the intended use-case. That being said, key metrics used to assess overall efficiency include latency and throughput.
Latency refers to the time delay between request and response. In the context of our favourite parallel processor, request is when the GPU receives a command for processing and response is when processing is complete and the result is returned.
Throughput is the number of units the GPU processes per second. This metric reflects the GPU’s processing capacity to handle multiple tasks in parallel. GPU architects and developers strive to minimize latency and maximize throughput.
These metrics are often looked at when benchmarking GPUs. For instance, the study, Benchmarking and Dissecting the Nvidia Hopper GPU Architecture, benchmarks Hopper GPUs with latency and throughput tests for different memory units, Tensor Cores, and new CUDA programming features introduced with Hopper (DPX, asynchronous data movement, and distributed shared memory).
From Stephen Jones’ 2022 GTC talk How CUDA Programming Works: Floating Point Operations Per Second (FLOPs) are often cited as a performance measure, but they’re rarely the limiting factor. GPUs typically have an abundance of floating-point computational power and therefore other aspects like memory bandwidth prove to be more significant bottlenecks.
GPU performance monitoring allows developers and system administrators to identify bottlenecks (is the job memory-bound, latency-bound, or compute-bound?), effectively allocate GPU resources, prevent overheating, manage power consumption, and make informed decisions about hardware upgrades. NVIDIA provides two powerful tools for GPU monitoring: Nsight and SMI.
The NVIDIA Nsight Systems is a system-wide performance analysis tool that allows visualization of an application’s algorithm and identification of areas for optimization. Additional information on NVIDIA NSight Compute can be found in the kernel profiling guide.
The NVIDIA System Management Interface (nvidia-smi) is a command-line tool, built on the NVIDIA Management Library, for managing and monitoring GPU devices. Additional information can be found in the nvidia-smi documentation.
Example nvidia-smi output:
This article is by no means conclusive to all there is to GPU optimization, but rather an introduction to the topic. It is encouraged to explore the links sprinkled throughout the article and in the references section to improve your understanding. More articles to come!
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!