An Introduction to GPU Performance Optimization for Deep Learning

Published on September 25, 2024

An Introduction to GPU Performance Optimization for Deep Learning

The Role of GPUs in Deep Learning

GPU computing has transformed industries, enabling applied deep learning advancements in autonomous vehicles, robotics, and molecular biology. The high-speed parallel processing capabilities offered by these machines accelerate the matrix multiplication computations required for processing and transforming massive amounts of data to train and make predictions (inference) with deep learning models comprised of layers of interconnected nodes (neural networks).

Training these neural networks and performing inference faster and cheaper is a high priority in AI research and development. With respect to GPU computing, this means understanding how to better optimize GPU performance.

Prerequisites

Familiarity with the following will help with understanding the topics presented in this article:

Machine learning (ML) & deep learning basics (ex:matrix multiplication, neural networks, Python, PyTorch)
Data types (INT, FP, etc.)
Recent NVIDIA GPU architectures: Blackwell (announced, but not yet available), Hopper (2022), Ampere (2020)
CUDA and the GPU memory hierarchy

Introduction to GPU Optimization

The goal of this article is to give readers the insight they need to improve their computing experience. Those keen on optimizing GPU performance are advised to learn about the features of the latest GPU architectures, understand the GPU programming language landscape, and gain familiarity with performance monitoring tools like NVIDIA Nsight and SMI. Experimenting, benchmarking, and iterating through GPU optimizations are critical for achieving better utilization of the hardware.

Harnessing the Hardware Features of NVIDIA GPUs

Knowledge of the intricacies of GPU architectures can improve your intuition around programming massively parallel processors. Throughout successive GPU iterations, NVIDIA has introduced a number of specialized hardware features to accelerate its parallel processing capabilities.

Tensor Cores

By default, many deep learning libraries (ex: PyTorch) train with single precision (FP32). However, single precision isn’t always necessary for achieving optimal accuracy. Lower precision requires less memory, increasing the speed at which data can be accessed (memory bandwidth).

Tensor Cores enable mixed-precision computing, where FP32 is used only when necessary and the lowest precision data type that doesn’t compromise accuracy is used. There are currently five generations of Tensor Cores, with the fourth generation in the Hopper architecture and the fifth generation in the Blackwell architecture.

Tensor Core	Data Type introduced
Volta (first generation)	FP16, FP32
Ampere (third generation)	Sparsity, INT8, INT4, FP64, BF16, TF32
Hopper (fourth generation)	FP8
Blackwell (fifth generation)	FP4

Transformer Engine

The Transformer Engine is a library that allows for 8-bit floating point (FP8) precision on Hopper GPUs. The introduction of FP8 precision in Hopper GPUs improved performance over FP16 without compromising accuracy. The second-generation Transformer Engine will be in the Blackwell architecture, allowing for FP4 precision.

Tensor Memory Accelerator

The Tensor Memory Accelerator (TMA) allows for asynchronous memory transfer between the GPU’s global and shared memory. Prior to the TMA, multiple threads and warps would work together to copy data. In contrast, with the TMA, a single thread in the thread block can issue a TMA instruction for asynchronous handling of the copy operation.

GPUs Are Programmable

Now, consider this: does hardware design influence the CUDA language? Or does the CUDA language motivate hardware design? Both are true. This relationship between hardware and software is well-described in the 2022 GTC talk, How CUDA Programming Works where Stephen Jones explains that the CUDA language evolved to make the physics of the hardware more programmable.

CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing platform designed to configure GPUs. CUDA supports C, C++, Fortran, and Python among other programming languages.

CUDA Libraries

There are a multitude of libraries built on top of CUDA to extend its functionality. Some notable ones include:

cuBLAS: a GPU-accelerated basic linear algebra (BLAS) library capable of accelerating low and mixed precision matrix multiplication.
cuDNN (CUDA Deep Neural Network): a library that provides implementations of operations that appear frequently in DNN applications such as convolution, attention, matrix multiplication, pooling, tensor transformation functions, etc.
CUTLASS (CUDA Templates for Linear Algebra Subroutines): a library that supports mixed-precision computing with optimized operations for various data types, including floating-point (FP16 to FP64), integer (4/8-bit), and binary (1-bit). It utilizes NVIDIA’s Tensor Cores for high-throughput matrix multiplication.
CuTe (CUDA Templates): a header-only C++ library that offers Layout and Tensor templates. These abstractions encapsulate essential information about the data such as type, shape, memory location, and organization, while also enabling intricate indexing operations.

Triton

Triton is a python-based language and compiler for parallel programming. Phil Tillet, the creator of Triton, explains in this video that the language was designed to address the limitations of GPU programming with respect to CUDA and existing domain-specific languages (DSLs).

While highly effective, CUDA is often a bit too complex to just jump into for researchers and practitioners without specialized GPU programming experience. This complexity not only impedes communication between GPU experts and ML researchers, but also hinders the rapid iteration required to accelerate development in compute-intensive fields.

Additionally, existing DSLs are restrictive. They lack support for custom data structures and control over parallelization strategies and resource allocation.

Triton strikes a balance by allowing its users to define and manipulate tensors in SRAM and modify them with the use of torch-like operators, while still providing the flexibility to implement custom parallelization and resource management strategies. Triton helps democratize GPU programming by making it possible to write efficient GPU code without extensive CUDA experience.

Leveraging the Memory Hierarchy

GPUs possess multiple memory types, with different sizes and speeds. The inverse relationship between memory size and speed is the basis behind the GPU memory hierarchy. Strategically allocating variables to different CUDA memory types gives developers more control over their program’s performance. The specified memory type impacts the variable’s scope (confined to a single thread, shared within thread blocks, etc.) and the speed at which it’s accessed. Variables stored in high-speed memory like registers or shared memory can be more quickly retrieved than variables stored in slower memory types such as global memory.

FlashAttention is an example of a hardware-aware algorithm that exploits the memory hierarchy.

What does GPU Performance even mean?

Performance evaluation in GPU computing depends on the intended use-case. That being said, key metrics used to assess overall efficiency include latency and throughput.

Latency refers to the time delay between request and response. In the context of our favourite parallel processor, request is when the GPU receives a command for processing and response is when processing is complete and the result is returned.

Throughput is the number of units the GPU processes per second. This metric reflects the GPU’s processing capacity to handle multiple tasks in parallel. GPU architects and developers strive to minimize latency and maximize throughput.

These metrics are often looked at when benchmarking GPUs. For instance, the study, Benchmarking and Dissecting the Nvidia Hopper GPU Architecture, benchmarks Hopper GPUs with latency and throughput tests for different memory units, Tensor Cores, and new CUDA programming features introduced with Hopper (DPX, asynchronous data movement, and distributed shared memory).

A100 GPU memory bandwidth vs. FLOPS

From Stephen Jones’ 2022 GTC talk How CUDA Programming Works: Floating Point Operations Per Second (FLOPS) are often cited as a performance measure, but they’re rarely the limiting factor. GPUs typically have an abundance of floating-point computational power and therefore other aspects like memory bandwidth prove to be more significant bottlenecks.

Performance Monitoring Tools

GPU performance monitoring allows developers and system administrators to identify bottlenecks (is the job memory-bound, latency-bound, or compute-bound?), effectively allocate GPU resources, prevent overheating, manage power consumption, and make informed decisions about hardware upgrades. NVIDIA provides two powerful tools for GPU monitoring: Nsight and SMI.

NVIDIA’s Nsight

The NVIDIA Nsight Systems is a system-wide performance analysis tool that allows visualization of an application’s algorithm and identification of areas for optimization. Additional information on NVIDIA NSight Compute can be found in the kernel profiling guide.

NVIDIA’s System Management Interface

The NVIDIA System Management Interface (nvidia-smi) is a command-line tool, built on the NVIDIA Management Library, for managing and monitoring GPU devices. Additional information can be found in the nvidia-smi documentation.

Example nvidia-smi output:

Conclusion

This article is by no means conclusive to all there is to GPU optimization, but rather an introduction to the topic. It is encouraged to explore the links sprinkled throughout the article and in the references section to improve your understanding. More articles to come!

References

NVIDIA Blogs/Documentation

Papers

Some Excellent Resources

Programming Massively Parallel Processors (4th edition)
GPU Mode youtube channel and discord server (previously called CUDA Mode)
Blog Post by Tim Dettmers: Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning
THE TRITON LANGUAGE | PHILIPPE TILLET
GPU optimization shared note
CUTLASS Tutorial:Mastering the NVIDIA® Tensor Memory Accelerator (TMA)
How CUDA Programming Works
FreeCodeCamp course by Elliot Arledge:CUDA Programming Course – High-Performance Computing with GPUs

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Melani Maheswaran

Author

See author profile

Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.

See author profile

Category:

Tutorial

Tags:

AI/ML