GPUs are described as parallel processors for their ability to execute work in parallel. Tasks are divided into smaller sub-tasks, executed simultaneously by multiple processing units, and combined to produce the final result. These processing units (threads, warps, thread blocks, cores, multiprocessors) share resources, such as memory, facilitating collaboration between them and enhancing overall GPU efficiency.
One unit in particular, warps, are a cornerstone of parallel processing. By grouping threads together into a single execution unit, warps allow for the simplification of thread management, the sharing of data and resources among threads, as well as the masking of memory latency with effective scheduling.
“The term warp originates from weaving, the first parallel-thread technology”
It may be helpful to read this “CUDA refresher” before proceeding**
In this article, we will outline how warps are useful for optimizing the performance of GPU-accelerated applications. By building an intuition around warps, developers can achieve significant gains in computational speed and efficiency.
Thread blocks are partitioned into warps comprised of 32 threads each. All threads in a warp run on the same Streaming Multiprocessor. Figure from an NVIDIA presentation on GPGPU AND ACCELERATOR TRENDS
When a Streaming Multiprocessor (SM) is assigned thread blocks for execution, it subdivides the threads into warps. Modern GPU architectures typically have a warp size of 32 threads.
The number of warps in a thread block depends on the thread block size configured by the CUDA programmer. For example, if the thread block size is 96 threads and the warp size is 32 threads, the number of warps per thread block would be: 96 threads/ 32 threads per warp = 3 warps per thread block.
In this figure, 3 thread blocks are assigned to the SM. The thread blocks are comprised of 3 warps each. A warp contains 32 consecutive threads. Figure from Medium article
Note how, in the figure, the threads are indexed, starting at 0 and continuing between the warps in the thread block. The first warp is made of the first 32 threads (0-31), the subsequent warp has the next 32 threads (32-63), and so forth.
Now that we’ve defined warps, let’s take a step back and look at Flynn’s Taxonomy, focusing on how this categorization scheme applies to GPUs and warp-level thread management.
Flynn’s Taxonomy is a classification system based on a computer architecture’s number of instructions and data streams. There are 4 classes: SISD (Single Instruction Single Data) , SIMD (Single Instruction Multiple Data), MISD (Multiple Instruction Single Data), MIMD (Multiple Instruction Multiple Data). Figure taken from CERN’s PEP root6 workshop
Flynn’s Taxonomy is a classification system based on a computer architecture’s number of instructions and data streams. GPUs are often described as Single Instruction Multiple Data (SIMD), meaning they simultaneously perform the same operation on multiple data operands. Single Instruction Multiple Thread (SIMT), a term coined by NVIDIA, extends upon Flynn’s Taxonomy to better describe the thread-level parallelism NVIDIA GPUs exhibit. In an SIMT architecture, multiple threads issue the same instructions to data. The combined effort of the CUDA compiler and GPU allow for threads of a warp to synchronize and execute identical instructions in unison as frequently as possible, optimizing performance.
While both SIMD and SIMT exploit data-level parallelism, they are differentiated in their approach. SIMD excels at uniform data processing, whereas SIMT offers increased flexibility as a result of its dynamic thread management and conditional execution.
In the context of warps, latency is the number of clock cycles for a warp to finish executing an instruction and become available to process the next one.
W denotes warp and T denotes thread. GPUs leverage warp scheduling to hide latency whereas CPUs execute sequentially with context switching. Figure from Lecture 6 of CalTech’s CS179
Maximum utilization is attained when all warp schedulers always have instructions to issue at every clock cycle. Thus, the number of resident warps, warps that are being executed on the SM at a given moment, directly affect utilization. In other words, there needs to be warps for warp schedulers to issue instructions to. Multiple resident warps enable the SM to switch between them, hiding latency and maximizing throughput.
Program counters increment each instruction cycle to retrieve the program sequence from memory, guiding the flow of the program’s execution. Notably, while threads in a warp share a common starting program address, they maintain separate program counters, allowing for autonomous execution and branching of the individual threads.
Pre-Volta GPUs had a single program counter for a 32 thread warp. Following the introduction of the Volta micro-architecture, each thread has its own program counter. As Stephen Jones puts it during his GTC’ 17 talk : "so now all these threads are wholly independent- they still work better if you gang them together…but you’re no longer dead in the water if you split them up."Figure from Inside Volta GPUs (GTC’17).
Separate program counters allow for branching, an if-then-else programming structure, where instructions are processed only if threads are active. Since optimal performance is attained when a warp’s 32 threads converge on one instruction, it is advised for programmers to write code that minimizes instances where threads within a warp take a divergent path.
Warps play an important role in GPU programming. This 32-thread unit leverages SIMT to increase the efficiency of parallel processing. Effective warp scheduling hides latency and maximizes throughput, allowing for the streamlined execution of complex workloads. Additionally, program counters and branching facilitate flexible thread management. Despite this flexibility, programmers are advised to avoid long sequences of diverged execution for threads in the same warp.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!