Large Language Models (LLMs) generate coherent, natural language responses, effectively automating a multitude of tasks that were previously exclusive to humans. As many key players in the field, such as Jensen Huang and Ilya Sutskever, have recently alluded to, we’re in an era of agentic AI. This new paradigm seeks to revolutionize various aspects of our lives, from personalized medicine and education to intelligent assistants, and beyond.
However, it is important to be aware that while these models are getting increasingly powerful, widespread adoption is hindered by the massive costs to run them, frustrating wait times that render certain real-world applications impractical, as well as, of course, their growing carbon footprint. To reap the benefits of this technology, while mitigating cost and power consumption, it is critical that we continue to optimize every aspect of LLM inference.
The goal of this article is to give readers an overview of current ways in which researchers and deep learning practitioners are optimizing LLM inference.
Similar to how one uses what they learned to solve a new problem, inference is when a trained AI model uses patterns detected during training to infer and make predictions on new data. This inference process is what enables LLMs to perform tasks like text completion, translation, summarization, and conversation.
DigitalOcean has partnered with HuggingFace to offer 1-click models. This allows for the integration of GPU Droplets with state-of-the-art open-source LLMs in Text Generation Inference (TGI)-optimized container applications. This means many of the inference optimizations covered in this article (ex: tensor parallelism, quantization, flashattention, paged attention) are already taken care of and maintained by HuggingFace. For information on how to use these 1-click models, check out our article Getting Started with LLMs.
While this article includes some introductory deep learning concepts, many topics discussed are relatively advanced. Those determined to better understand inference optimization are encouraged to explore the links scattered throughout the article and in the references section.
It is advised that readers have an understanding of neural network fundamentals, the attention mechanism, the transformer, and data types before proceeding.
It would also help to be knowledgeable about the GPU memory hierarchy.
The article,Introduction to GPU Performance Optimization, provides context on how GPUs can be programmed to accelerate neural network training and inference. It also explains key terms such as latency and throughput.
The prefill phase can be likened to reading an entire document at once and processing all the words simultaneously to write the first word whereas the decode phase can be compared to continuing to write this response word by word, where the choice of each word depends on what was written before.
LLM inference can be divided into two phases: prefill and decode. These stages are separated due to the different computational requirements of each stage. While prefill, a highly-parallelized matrix-matrix operation that saturates GPU utilization, is compute-bound, decode, a matrix-vector operation that underutilizes the GPU compute capability, is memory-bound.
Let’s explore why prefill is compute-bound and decode is memory-bound.
In the prefill stage, the LLM processes the entire input prompt at once to generate the first response token. This involves performing a full forward pass through the transformer layers for every token in the prompt simultaneously. While memory access is needed during prefill, the computational work of processing the tokens in parallel dominate the performance profile.
In the decode stage, text is generated autoregressively where the next token is predicted one at time given all previous tokens. The decoding process is memory-bound due to its need to repeatedly access historical context. For each new token generated, the model must load the attention cache (key/value states, AKA KV cache) from all previous tokens, requiring frequent memory accesses that become more intensive as the sequence grows longer.
Metrics can be used to assess performance and identify areas of potential bottlenecks during these two inference stages.
Metric | Definition | Why do we care? |
---|---|---|
Time-to-First-Token (TTFT) | Time to process the prompt and generate the first token. TTFT tells us how long prefill took. | The longer the prompt, the longer the TTFT as the attention mechanism needs the whole input sequence to compute the KV cache. Inference optimization seeks to minimize TTFT. |
Inter-token Latency (ITL) AKA Time per Output Token | Average time between consecutive tokens. ITL tells us the rate at which decoding (token generation) occurs. | Consistent ITLs are ideal as they are indicative of efficient memory management, high GPU memory bandwidth, and well-optimized attention computation. |
Speculative Decoding uses a smaller, faster model to generate multiple tokens simultaneously, and then verifies them with the larger target model. Since the generated samples come from exactly the same probability distribution as those produced by naïve decoding, speculative decoding results in speed-ups in inference, while maintaining the same quality of responses.
SARATHI shows how chunked prefills can enable the division of large prefills into manageable chunks, which can then be batched with decode requests (decode-maximal batching) for more efficient processing.
Batching groups inference requests together, with larger batch sizes corresponding to higher throughput. However, batch sizes can only be increased up to a certain extent due to limited GPU on-chip memory.
To achieve maximum utilization of the hardware, one can try to find the critical ratio where there’s a balance between two key limiting factors:
While these two times are equal, the batch size can be increased without incurring any performance penalty. Beyond this point, increasing batch size would create bottlenecks in either memory transfer or computation. To determine an optimal batch size, profiling is important.
KV cache management plays a critical role in determining the maximum batch size and improving inference. Thus, the rest of the article will focus on managing the KV cache.
When looking at how memory is allocated in the GPU during serving, the model weights remain fixed and the activations only utilize a fraction of the GPU’s memory resources compared to the KV cache. Therefore, freeing up space for the KV cache is critical. This can be achieved by reducing the model weight memory footprint through quantization, reducing the KV cache memory footprint with modified architectures and attention variants, as well as pooling memory from multiple GPUs with parallelism.
Quantization reduces the number of bits needed to store the model’s parameters (ex: weights, activations, and gradients). This technique reduces inference latency by exchanging memory for accuracy.
Review of Queries, Keys, and Values:
Queries: Represent the context or question.
Keys: Represent the information being attended to.
Values: Represent the information being retrieved.
Attention weights are computed by comparing queries with keys, and then used to weight values, producing the final output representation.
Query (Prompt) → Attention Weights → Relevant Information (Values)
Attention Variant | Description | Visual Representation |
---|---|---|
Scaled Dot-Product Attention | Scaled Dot-Product Attention (SDPA) is a key component of the Transformer architecture, and is a type of self-attention mechanism that allows the model to attend to different parts of the input sequence simultaneously and weigh their relevance. | |
Multi-Head Attention | In Multi-Head Attention (MHA), multiple SDPA heads operate in parallel resulting in richer relationships between different aspects of the input sequence to be captured. | |
Multi-Query Attention | Multi-Query Attention (MQA) is a memory-efficient refinement to MHA where one key-value head is shared across multiple attention heads. MQA reduces the size of KV-cache memory allowing space for larger batch sizes. While MQA results in faster decoder inference than MHA, there may be quality degradation. | |
Grouped-Query Attention | Grouped-Query Attention (GQA) modifies MQA to have groupings of queries for multiple key-value heads. The number of key-value heads is more than one as in MQA, but less than the number of queries as in MHA. By striking a balance between MQA and MHA, GQA achieves comparable quality to MHA, while also achieving speed similar to MQA. |
Sliding Window Attention (SWA) or local attention, restricts attention to a fixed-size window that slides over the sequence. While SWA is not efficient to scale to long inputs, Character AI saw that speed and quality wasn’t impacted with long sequences when interleaving SWA and global attention, with adjacent global attention layers sharing a KV cache (cross-layer attention). |
---|
Local and global attention mechanisms differ in key aspects. Local attention uses less computation (O(n * w)) and memory by focusing on token windows, enabling faster inference especially for long sequences, but may miss long-range dependencies. Global attention, while computationally more expensive (O(n^2)) and memory-intensive due to processing all token pairs, is able to better capture full context and long-range dependencies at the cost of slower inference speed.
Inspired by virtual memory allocation, PagedAttention proposed a framework for optimizing KV cache that takes the variation of the number of tokens across requests into consideration.
There are three variations of FlashAttention, with FlashAttention-3 being the latest release and optimized for Hopper GPUs. Each iteration of this algorithm takes a hardware-aware approach to make the attention computation as fast as possible. Past articles written on FlashAttention include: Designing Hardware-Aware Algorithms: FlashAttention and FlashAttention-2
Dense LLMs are the standard where all parameters are actively engaged during inference.
Mixture of Experts (MoE) LLMs are composed of multiple specialized sub-networks with a routing mechanism. Because only relevant experts are activated for each input, improved parameter efficiency and faster inference than dense models is often observed.
Larger models often require multiple GPUs to run effectively. There are a number of different parallelization strategies that allow for multi-GPU inference.
Parallelism Type | Partions | Description | Purpose |
---|---|---|---|
Data | Data | Splits different batches of data across devices. | Distribution of memory and computation for large datasets that wouldn’t fit on a single device |
Tensor | Weight Tensors | Splits tensors across multiple devices either row-wise or column-wise | Distribution of memory and computation for large tensors that wouldn’t fit on a single device |
Pipeline | Model Layers (vertically) | Splits different stages of the full model pipeline in parallel | Improves throughput by overlapping computation of different model stages |
Context | Input Sequences | Divides input sequences into segments across devices | Reduces memory bottleneck for long sequence inputs |
Expert | MoE models | Splits experts, where each expert is a smaller model, across devices | Allows for larger models with improved performance by distributing computation across multiple experts |
Fully Sharded Data | Data, model, optimizer, and gradients | Shards components across devices, processes data in parallel, and synchronizes after each training step. Parameters are fetched and reconstructed from shards as needed, used for computation, and then promptly discarded, reducing memory footprint. | Enables training of extremely large models that exceed the memory capacity of a single device by distributing both model parameters and activations. |
It’s undeniable that inference is an exciting area of research and optimization. The field moves fast, and to keep up, inference needs to move faster. In addition to more agentic workflows, we’re seeing more dynamic inference strategies that allow models to “think longer” on harder problems. For example, reasoning models like OpenAI’s o1 model shows consistent performance improvements on challenging mathematical and programming tasks when more computational resources are devoted during inference.
Thanks so much for reading! This article is certainly not conclusive to all there is in inference optimization. Stay tuned for more exciting articles on this topic and adjacent ones.
Blog posts:
Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog
LLM Inference at scale with TGI
Looking back at speculative decoding (Google Research)
LLM Inference Series: 4. KV caching, a deeper look | by Pierre Lienhart | Medium
A Visual Guide to Quantization - by Maarten Grootendorst
Optimizing AI Inference at Character.AI
Optimizing AI Inference at Character.AI (Part Deux)
Papers:
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
Efficient Memory Management for Large Language Model Serving with PagedAttention
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Context Parallelism for Scalable Million-Token Inference
Talks:
Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou
NVIDIA CEO Jensen Huang Keynote at CES 2025
Building Machine Learning Systems for a Trillion Trillion Floating Point Operations :: Jane Street
Dylan Patel - Inference Math, Simulation, and AI Megaclusters - Stanford CS 229S - Autumn 2024
How does batching work on modern GPUs?
GitHub Links:
Sharan Chetlur -Nvidia/Presentation Slides - High Performance LLM Serving on Nvidia GPUs
GitHub - huggingface/search-and-learn: Recipes to scale inference-time compute of open models
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!