Tutorial

The Hidden Bottleneck: How GPU Memory Hierarchy Affects Your Computing Experience

Updated on November 8, 2024
The Hidden Bottleneck: How GPU Memory Hierarchy Affects Your Computing Experience

Introduction

The GPU memory hierarchy is increasingly becoming an area of interest for deep learning researchers and practitioners alike. By building an intuition around memory hierarchy, developers can minimize memory access latency, maximize memory bandwidth, and reduce power consumption leading to shorter processing times, accelerated data transfer, and cost-effective compute usage. A thorough understanding of memory architecture will enable developers to achieve peak GPU capabilities at scale.

CUDA Refresher

CUDA (Compute Unified Device Architecture) is a parallel computing platform developed by NVIDIA for configuring GPUs.

The execution of a CUDA program begins when the host code (CPU serial code) calls a kernel function. This function call launches a grid of threads on a device (GPU) to process different data components in parallel.

A thread is comprised of the program’s code, the current execution point in the code, as well as the values of its variables and data structures. A group of threads form a thread block and a group of thread blocks compose the CUDA kernel grid. The software components, threads and thread blocks, correspond directly to their hardware analogs, the CUDA core and the CUDA Streaming Multiprocessor (SM).

All together, these make up the constituent parts of the GPU.

image

Threads are organized into blocks and blocks are organized into grids. Figure taken from the NVIDIA Technical Blog.

image

Figure taken from NVIDIA H100 White Paper.

H100s introduce a new Thread Block Cluster architecture, extending GPU’s physical programming architecture to now include Threads, Thread Blocks, Thread Block Clusters, and Grids.

CUDA Memory Types

There are varying degrees of accessibility and duration for memory storage types utilized by a CUDA device. When a CUDA programmer assigns a variable to a specific CUDA memory type, they dictate how the variable is accessed, the speed at which it’s accessed, and the extent of its visibility.

Here’s a quick overview of the different memory types:

image

Figure taken from Chapter 5 of the 4th edition of the textbook, Programming Massively Parallel Processors.

Register memory is private to each thread. This means that when that particular thread ends, the data for that register is lost.

Local memory is also private to each thread, but it’s slower than register memory.

Shared memory is accessible to all threads in the same block and lasts for the block’s lifetime.

Global memory holds data that lasts for the duration of the grid/host. All threads and the host have access to global memory.

Constant memory is read-only and designed for data that does not change for the duration of the kernel’s execution.

Texture memory is another read-only memory type ideal for physically adjacent data access. Its use can mitigate memory traffic and increase performance compared to global memory.

GPU Memory Hierarchy

The Speed-Capacity Tradeoff

It is important to understand that with respect to memory access efficiency, there is a tradeoff between bandwidth and memory capacity. Higher speed is correlated with lower capacity.

Registers

Registers are the fastest memory components on a GPU, comprising the register file that supplies data directly into the CUDA cores. A kernel function uses registers to store variables private to the thread and accessed frequently.

Both registers and shared memory are on-chip memories where variables residing in these memories can be accessed at very high speeds in a parallel manner.

By leveraging registers effectively, data reuse can be maximized and performance can be optimized.

Cache Levels

Multiple levels of caches exist in modern processors. The distance to the processor is reflected in the way these caches are numbered.

L1 Cache

L1 or level 1 cache is attached to the processor core directly. It functions as a backup storage area when the amount of active data exceeds the capacity of a SM’s register file.

L2 Cache

L2 or level 2 cache is larger and often shared across SMs. Unlike the L1 cache(s), there is only one L2 cache.

Constant Cache

Constant cache captures frequently used variables for each kernel leading to improved performance.

When designing memory systems for massively parallel processors, there will be constant memory variables. Rewriting these variables would be redundant and pointless. Thus, a specialized memory system like the constant cache eliminates the need for computationally costly hardware logic.

New Memory Features with H100s

image

NVIDIA Hopper Streaming Multiprocessor. Figure taken from NVIDIA H100 White Paper.

Hopper, through its H100 line of GPUs, introduced new features to augment its performance compared to previous NVIDIA micro-architectures.

Thread Block Clusters

As mentioned earlier in the article, Thread Block Clusters debuted with H100s, expanding the CUDA programming hierarchy. A Thread Block Cluster allows for greater programmatic control for a larger group of threads than permissible by a Thread Block on a single SM.

Asynchronous Execution

The latest advancements in asynchronous execution introduce a Tensor Memory Accelerator (TMA) and an Asynchronous Transaction Barrier into the Hopper architecture.

The Tensor Memory Accelerator (TMA) unit allows for the efficient data transfer of large blocks between global and shared memory.

The Asynchronous Transaction Barrier allows for synchronization of CUDA threads and on-chip accelerators, regardless of whether they are physically located on separate SMs.

image

H100s contain both the Asynchronous Barriers introduced within the Ampere GPU architecture and the new Asynchronous Transaction Barriers.

Conclusion

Assigning variables to specific CUDA memory types allows a programmer to exercise precise control over its behaviour. This designation not only determines how the variable is accessed, but also the speed at which this access occurs. Variables stored in memory types with faster access times, such as registers or shared memory, can be quickly retrieved, accelerating computation. In contrast, variables in slower memory types, such as global memory, are accessed at a slower rate. Additionally, memory type assignment influences scope of the variable’s usage and interaction with other threads. The assigned memory type governs whether the variable is accessible to a single thread, a block of threads or all threads within a grid. Finally, H100s, the current SOTA GPU for AI workflows, introduced several new features that influence memory access such as Thread Block Clusters, the Tensor Memory Accelerator (TMA) unit, and Asynchronous Transaction Barriers.

References

Programming Massively Parallel Processors (4th edition)

Hopper Whitepaper

CUDA Refresher: The CUDA Programming Model | NVIDIA Technical Blog

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the authors

Still looking for an answer?

Ask a questionSearch for more help

Was this helpful?
 
Leave a comment


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Limited Time: Introductory GPU Droplet pricing.

Get simple AI infrastructure starting at $2.99/GPU/hr on-demand. Try GPU Droplets now!

Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

Become a contributor for community

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

DigitalOcean Documentation

Full documentation for every DigitalOcean product.

Resources for startups and SMBs

The Wave has everything you need to know about building a business, from raising funding to marketing your product.

Get our newsletter

Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.

New accounts only. By submitting your email you agree to our Privacy Policy

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.