Tutorial

Introduction to NVIDIA CUDA Achieving Peak Performance with H100 for AI and Deep Learning

Published on September 30, 2024

Introduction

In this article, we will introduce NVIDIA CUDA for parallel computing. CUDA (Compute Unified Device Architecture) is NVIDIA’s platform for parallel computing; the software layer allowing developers to use the power of GPUs for general-purpose tasks. CUDA serves as the connecting bridge between NVIDIA GPUs and GPU-based applications, allowing the use of popular deep learning libraries like TensorFlow and PyTorch to leverage GPU acceleration. This capability is crucial for optimizing deep learning tasks or powering GPU accelerated applications.

Prerequisites

Understanding Parallel Computing: Know basic concepts like threads, parallelism, and task distribution.
Basic GPU Knowledge: Understanding what GPUs are and their general role in computing.
Linear Algebra Basics: Familiarity with matrices and vector operations, as CUDA often involves matrix computations.
Installed CUDA Toolkit: Ensure that the CUDA toolkit and compatible NVIDIA GPU drivers are installed on your system.

A Brief History of NVIDIA CUDA

Compute Unified Device Architecture, a.k.a. CUDA is a parallel computing platform developed by NVIDIA with an initial release date of 23 June 2007. NVIDIA CUDA revolutionized how GPUs are used for general-purpose computing (GPGPU).

CUDA started as NVIDIA’s solution to utilizing the potential of graphics processing units (GPUs) for general-purpose computing. Before CUDA, GPUs were primarily used for graphics rendering in games and visual applications. However, researchers recognized that GPUs could perform many calculations in parallel, making them suitable for other computational tasks beyond graphics, such as scientific simulations and data processing.

Ian Buck is a key figure in the development of CUDA. Before joining NVIDIA, Ian Buck worked on the Brook programming language at Stanford University, one of the first tools to enable general-purpose computing on GPUs. His research and insights from Brook helped lay the groundwork for what would become CUDA.

He joined NVIDIA in the early 2000s and played a central role in transforming GPUs from graphics-focused hardware into powerful tools for general-purpose computing (GPGPU). Buck led the team that developed CUDA at NVIDIA. He worked on creating a platform where developers could easily use common programming languages like C to access GPU computing power for tasks beyond graphics. This marked a significant leap from early, more complex methods for GPGPU programming, such as shading languages. In 2007, Buck’s efforts led to the release of CUDA, NVIDIA’s revolutionary parallel computing platform. NVIDIA CUDA made it possible to use GPUs for various applications, including scientific research, engineering simulations, and, eventually, AI and deep learning. By around 2015, the development of CUDA’s focus shifted towards neural networks and AI.

Why do we need CUDA?

A CPU (central processing unit) is the main component in a computer that is responsible for executing code, performing tasks like file management, data processing, and handling user inputs. While a CPU can multitask, each core can only manage one task simultaneously. CPUs typically have 2 to 8 cores, which is more than enough for everyday tasks, and they’re so fast that we hardly notice tasks being handled in sequence.

On the other hand, a GPU (graphics processing unit) is specialized for handling parallel computations, making it more powerful for tasks like graphics rendering and, more recently, for things like AI and scientific computations. While both CPUs and GPUs are crucial hardware, they’re optimized for different tasks.

Let us look at the numbers: a top consumer CPU may have 16 cores, an Nvidia RTX 4090 GPU has 16,384 CUDA cores, and an H100 has 18,432. So imagine the amount of workload that a GPU can handle.

Then the question arises: why do we need a CPU at all?

While GPUs are known for parallel computing and multitasking, we still need CPUs because they’re better suited for simpler, sequential tasks where multitasking isn’t always efficient. While the GPU has many smaller processors, the fewer, more complex cores of a CPU are advantageous for sequential task execution. The real advantage, though, comes from using both together — CPUs handle general computing, while GPUs tackle heavy parallel workloads. Here comes the job for CUDA. CUDA lets developers switch between the two efficiently, maximizing performance using the best tool for each job.

CUDA lets programmers explore the power of thousands of GPU cores to create parallel algorithms, which is key for tasks like machine learning, video editing, scientific research, and data processing. By providing a programming model and APIs, CUDA allows developers to run code directly on the GPU, boosting performance over traditional CPU-based methods. Offloading intensive workloads to the GPU through CUDA helps drive advancements in high-performance computing.

GPU Graph

Source

This graph shows how the processing power of NVIDIA and Intel chips has grown over the past decade, measured in billions of calculations per second.

Understanding CUDA Architecture

CUDA (Compute Unified Device Architecture) is NVIDIA’s platform for parallel computing. It allows developers to use GPUs’ power for general-purpose tasks, not just graphics rendering.

In CUDA, three types of functions are used to support parallel programming on the CPU and GPU, each with specific roles. Here’s a breakdown:

1. Host Functions (CPU-only)

What they do: These are regular functions that run only on the CPU, not the GPU.
How they work: They are like regular C functions responsible for calling and managing GPU operations. You don’t need any unique qualifiers for them.

2. Kernel Functions (CPU calls, GPU executes)

What they do: Kernel functions are particular functions written to be executed on the GPU but are called from the CPU.
Special feature: You must use the qualifier __global__ when defining these functions. They always return void (i.e., no return values).
How it works: The CPU calls these kernel functions, but the actual execution happens on the GPU. When calling a kernel function, specify how many threads and - - blocks you want to use for parallel processing.

3. Device Functions (GPU-only)

What they do: These functions run on and are called by the GPU itself. They help the GPU perform tasks more efficiently by breaking them into smaller parts.
Special feature: These functions use the qualifier __device__, and unlike kernel functions, they can return values of any type (not just void).

Putting it together:

Host functions: Only run on the CPU and are used to set up and manage GPU tasks.
Kernel functions: These are called by the CPU but run on the GPU, handling the heavy parallel computation.
Device functions: Called and executed by the GPU, allowing for more flexible, GPU-side computations.

CUDA Architecture is designed to maximize the graphics card’s computational power by organizing tasks into a hierarchical structure of grids, blocks, and threads. Each block contains multiple threads, and multiple blocks make up a grid. This setup allows for a high level of parallelism, enabling efficient use of the GPU’s resources.

CUDA Architecture

Source

The Grid

A grid in CUDA is a group of threads running the same kernel, but they are not synchronized. Each CUDA call from the CPU involves one grid. While multiple grids can run simultaneously, grids can’t be shared between GPUs in multi-GPU systems, as each GPU uses its grids for maximum efficiency.

The Block

Grids in CUDA are made up of blocks, each containing multiple threads and shared memory. Like grids, blocks are not shared between multiprocessors. All the blocks in a grid run the same program, and the “blockIdx” variable helps identify the current block. Block IDs can be 1D or 2D, depending on the grid’s dimensions. Typically, a GPU can have up to 65,535 blocks.

The Thread

Blocks consist of threads, which run on the cores of multiprocessors, but unlike grids and blocks, threads are not limited to a single core. Each thread has a unique ID called “threadIdx,” which can be 1D, 2D, or 3D, depending on the block’s dimensions. The thread ID is specific to the block it’s part of. Threads also have access to a certain amount of register memory, with up to 512 threads typically allowed per block. These threads are the base to run parallel execution on a GPU much like individual workers, each handling a small part of the task.

These threads are grouped into blocks, and each block can have many threads (up to 1,024). The GPU can run many blocks in parallel. Furthermore, every thread has its unique ID allowing it to work on specific portions of a task. This structure allows the GPU to break a task into thousands of threads and execute them simultaneously.

Usually, a more significant number of threads gives better performance.

Key Components of CUDA Architecture:

Multiprocessors: The GPU has multiple streaming multiprocessors (SMs), each containing many CUDA cores that run the threads.
Kernel: All of the threads will execute the same function (code), known as a kernel.
CUDA Cores: These are the actual processors that execute the code. The more the better.
Warp: A group of 32 threads executed together in a SIMT (Single Instruction, Multiple Threads) fashion. This is the basic scheduling unit in CUDA.

In CUDA, the parts of a program that can be parallelized need to be broken down into many threads that can run simultaneously. These threads are created using special functions called kernels, which are simply functions designed to run on the GPU. The kernel is executed or launched as a set of threads.

In the CUDA programming model, the **CPU (host)** and **GPU (device)** each have their own memory. The CPU excels at running serial tasks but struggles with massively parallel operations, which is where the GPU comes into play. GPUs handle these parallel tasks, and the CPU offloads them to the GPU. This approach, known as **heterogeneous parallel programming**, allows the host and device to work together.

CUDA designed for NVIDIA GPUs, manages both host and device. The CPU controls most of the program, but it sends that task to the GPU when a section can be parallelized. Data between the CPU and GPU is transferred via the **PCI Express bus**, which is slow, so only highly parallel tasks are offloaded to the GPU to maximize efficiency.

GPU Tree

Source

In the image provided, CUDA consists of an array of streaming processors that handle a high level of threading. The number of streaming multiprocessors (SMs) in each GPU can vary between generations. Each SM contains several streaming processors (SPs) that share control logic and instruction cache.

Memory Hierarchy

CUDA organizes memory in different levels:

Global Memory: This is the main memory of the GPU a.k.a. Read and write memory, accessible by all threads, but it’s slower compared to other types of memory.
Shared Memory: Each block of threads has its shared memory, which is faster and used for data that the threads within a block need to share.
Registers: Each thread has its tiny memory called a register, which is used for storing data specific to that thread.
Constant Memory: In CUDA, constants and kernel arguments are stored in constant memory, which is a special type of memory on the GPU. Accessing constant memory can be slower than accessing registers or shared memory, but it is cached. This cache helps speed up access times, especially when multiple threads read the same data, making it more efficient despite its slowness.
Texture Memory: The read-only memory. Its cache is optimized for 2D spatial access patterns.

Why Do We Use GPUs for Deep Learning?

We primarily want to use GPUs because of their extreme computational power. GPUs have two main advantages over CPUs: They can handle a huge number of calculations at once and have very fast memory access, allowing them to process large amounts of data quickly. Deep learning uses GPUs because they excel at handling the large-scale, parallel computations required by neural networks. Here’s why:

Parallel Processing
Deep learning models, especially deep neural networks, involve thousands or even millions of matrix operations, like multiplying and adding large arrays of numbers. GPUs are designed to perform these tasks in parallel across thousands of cores, making them much faster than CPUs, which execute tasks sequentially.

High Throughput
Training deep learning models requires processing vast amounts of data, which is time-consuming on a CPU. GPUs process large batches of data simultaneously, reducing training time from days to hours. For example, training an image recognition model that would take significantly longer on a CPU than a GPU.

Large-Scale Neural Networks
Deep learning models often have numerous layers and parameters (weights) to adjust. For models like Transformer-based architectures (used in language models like GPT), GPUs allow the simultaneous training of large models by distributing the load across many cores.

Examples

Image Recognition: In Convolutional Neural Networks (CNNs), a single image might require millions of matrix multiplications to identify patterns. A GPU can handle these computations in parallel, making training faster and more efficient.
Natural Language Processing (NLP): In models like GPT or BERT, GPUs accelerate the training of attention mechanisms, which require simultaneous calculations across large data sequences.

The difference in processing power between GPUs and CPUs comes from how they are designed for different tasks:

GPUs are built for handling many calculations at once. Most GPUs’ designs focus on processing lots of data, not on storing or managing data.
CPUs, on the other hand, are built to quickly handle a few tasks at a time, minimizing delays (latency). They use a large part of their design for managing and storing data, which is why they’re great for running general-purpose programs like operating systems.

In short, while CPUs are designed to reduce delays and handle complex tasks one at a time, GPUs are designed to process large amounts of data simultaneously by focusing more on raw computing power (with lots of cores for calculations).

CUDA Installation

Before starting to use NVIDIA CUDA, one needs to install it and go through several steps. We have mentioned a few general steps when trying to install CUDA in personal devices before.

Check GPU Compatibility: Ensure that the hardware is CUDA compatible. Please check this on the NVIDIA website.
Install NVIDIA Drivers: Download and install the latest NVIDIA drivers for the GPU. These drivers are essential for enabling CUDA functionality.
Download the CUDA Toolkit: Go to the NVIDIA CUDA Toolkit page and download the version that matches your operating system (Windows, Linux, or macOS).
Install the Toolkit: Follow the instructions listed on NVIDIA documentation to install the CUDA toolkit. This includes the CUDA libraries, development tools, and necessary headers.
Set Up Environment Variables: After installation, add CUDA paths (like bin and lib) to your system’s environment variables to ensure the system can locate CUDA executables and libraries.

However, with DigitalOcean GPU droplets, users can skip the hassle of installing CUDA as it comes with CUDA version 12.2 pre-installed.

The H100 GPU: The Next Level of CUDA Performance

H100

DigitalOcean’s GPU Droplets are virtualized GPU instances that can be used to train models using the state-of-the-art NVIDIA H100 GPUs. NVIDIA GPUs are incredibly powerful and perform thousands of parallel computations across numerous cores. The release of the Hopper microarchitecture, led by the NVIDIA H100, has set new benchmarks, surpassing its Ampere predecessors like the A100. Each new architecture brings major improvements in VRAM, CUDA cores, and bandwidth.

The DigitalOcean GPU Droplets come with the pre-installed CUDA version of 12.2, an advanced release of NVIDIA’s CUDA platform. A few of the key features include:

Optimizations for running workloads faster on newer NVIDIA GPUs like the H100 and A100, improving parallel computing efficiency.
Enhanced support for C++, Python, and other programming languages, making it easier to develop CUDA applications across different platforms.
Better memory management and task scheduling, improving how GPUs handle complex workloads.

Why NVIDIA H100 Tensor Core GPU?

NVIDIA H100 Tensor Core GPU provides the second highest PCIe memory bandwidth of any commercially available GPU—over 2 TB/s. With 80GB of VRAMThis allows the H100 to handle large datasets and models at incredible speeds, making it ideal for large-scale AI applications.

The H100 powered by the hopper architecture achieves this remarkable performance through its 4th generation Tensor Cores, featuring 640 Tensor Cores and 128 Ray Tracing Cores, alongside 14,592 CUDA cores. This combination delivers an impressive 26 teraFLOPS for full precision (fp64) operations.

Additionally, the H100 supports various math precisions, enabling it to handle a wide range of computing tasks efficiently. It can manage double precision (FP64), single precision (FP32), half-precision (FP16), and integer (INT8) workloads, making it a versatile choice for any computational need.

Conclusion

NVIDIA’s CUDA is a way to divide complex tasks into many small, parallel tasks that run on a GPU’s thousands of cores. It uses threads, blocks, and grids to manage this parallelism, and the memory hierarchy ensures that the GPU processes data efficiently. This architecture makes CUDA ideal for tasks like deep learning, image processing, and scientific simulations, where speed and parallel computation are critical.
This article introduced CUDA, and we will soon release more articles that provide a deeper understanding of its features, applications, and best practices for optimizing performance in various computing tasks.