LLM Inference Optimization 101

Published on January 17, 2025

Fast inference makes the world go brrr

Large Language Models (LLMs) generate coherent, natural language responses, effectively automating a multitude of tasks that were previously exclusive to humans. As many key players in the field, such as Jensen Huang and Ilya Sutskever, have recently alluded to, we’re in an era of agentic AI. This new paradigm seeks to revolutionize various aspects of our lives, from personalized medicine and education to intelligent assistants, and beyond.

However, it is important to be aware that while these models are getting increasingly powerful, widespread adoption is hindered by the massive costs to run them, frustrating wait times that render certain real-world applications impractical, as well as, of course, their growing carbon footprint. To reap the benefits of this technology, while mitigating cost and power consumption, it is critical that we continue to optimize every aspect of LLM inference.

The goal of this article is to give readers an overview of current ways in which researchers and deep learning practitioners are optimizing LLM inference.

What is LLM inference?

Similar to how one uses what they learned to solve a new problem, inference is when a trained AI model uses patterns detected during training to infer and make predictions on new data. This inference process is what enables LLMs to perform tasks like text completion, translation, summarization, and conversation.

Text Generation Inference with 1-click Models

DigitalOcean has partnered with HuggingFace to offer 1-click models. This allows for the integration of GPU Droplets with state-of-the-art open-source LLMs in Text Generation Inference (TGI)-optimized container applications. This means many of the inference optimizations covered in this article (ex: tensor parallelism, quantization, flashattention, paged attention) are already taken care of and maintained by HuggingFace. For information on how to use these 1-click models, check out our article Getting Started with LLMs.

Prerequisites

While this article includes some introductory deep learning concepts, many topics discussed are relatively advanced. Those determined to better understand inference optimization are encouraged to explore the links scattered throughout the article and in the references section.

It is advised that readers have an understanding of neural network fundamentals, the attention mechanism, the transformer, and data types before proceeding.

It would also help to be knowledgeable about the GPU memory hierarchy.

The article,Introduction to GPU Performance Optimization, provides context on how GPUs can be programmed to accelerate neural network training and inference. It also explains key terms such as latency and throughput.

The Two Phases of LLM Inference

The prefill phase can be likened to reading an entire document at once and processing all the words simultaneously to write the first word whereas the decode phase can be compared to continuing to write this response word by word, where the choice of each word depends on what was written before.

LLM inference can be divided into two phases: prefill and decode. These stages are separated due to the different computational requirements of each stage. While prefill, a highly-parallelized matrix-matrix operation that saturates GPU utilization, is compute-bound, decode, a matrix-vector operation that underutilizes the GPU compute capability, is memory-bound.

KV caching

Let’s explore why prefill is compute-bound and decode is memory-bound.

Prefill

In the prefill stage, the LLM processes the entire input prompt at once to generate the first response token. This involves performing a full forward pass through the transformer layers for every token in the prompt simultaneously. While memory access is needed during prefill, the computational work of processing the tokens in parallel dominate the performance profile.

Decode

In the decode stage, text is generated autoregressively where the next token is predicted one at time given all previous tokens. The decoding process is memory-bound due to its need to repeatedly access historical context. For each new token generated, the model must load the attention cache (key/value states, AKA KV cache) from all previous tokens, requiring frequent memory accesses that become more intensive as the sequence grows longer.

Metrics can be used to assess performance and identify areas of potential bottlenecks during these two inference stages.

Metrics

Metric	Definition	Why do we care?
Time-to-First-Token (TTFT)	Time to process the prompt and generate the first token. TTFT tells us how long prefill took.	The longer the prompt, the longer the TTFT as the attention mechanism needs the whole input sequence to compute the KV cache. Inference optimization seeks to minimize TTFT.
Inter-token Latency (ITL) AKA Time per Output Token	Average time between consecutive tokens. ITL tells us the rate at which decoding (token generation) occurs.	Consistent ITLs are ideal as they are indicative of efficient memory management, high GPU memory bandwidth, and well-optimized attention computation.

Optimizing Prefill and Decode

Speculative Decoding

Speculative Decoding uses a smaller, faster model to generate multiple tokens simultaneously, and then verifies them with the larger target model. Since the generated samples come from exactly the same probability distribution as those produced by naïve decoding, speculative decoding results in speed-ups in inference, while maintaining the same quality of responses.

Chunked Prefills and Decode-Maximal Batching

SARATHI shows how chunked prefills can enable the division of large prefills into manageable chunks, which can then be batched with decode requests (decode-maximal batching) for more efficient processing.

Batching

Batching groups inference requests together, with larger batch sizes corresponding to higher throughput. However, batch sizes can only be increased up to a certain extent due to limited GPU on-chip memory.

Batch Size

To achieve maximum utilization of the hardware, one can try to find the critical ratio where there’s a balance between two key limiting factors:

The time needed to transfer weights between memory and compute units (limited by memory bandwidth)
The time required for actual computational operations (limited by FLOPS)

While these two times are equal, the batch size can be increased without incurring any performance penalty. Beyond this point, increasing batch size would create bottlenecks in either memory transfer or computation. To determine an optimal batch size, profiling is important.

KV cache management plays a critical role in determining the maximum batch size and improving inference. Thus, the rest of the article will focus on managing the KV cache.

KV Cache Management

When looking at how memory is allocated in the GPU during serving, the model weights remain fixed and the activations only utilize a fraction of the GPU’s memory resources compared to the KV cache. Therefore, freeing up space for the KV cache is critical. This can be achieved by reducing the model weight memory footprint through quantization, reducing the KV cache memory footprint with modified architectures and attention variants, as well as pooling memory from multiple GPUs with parallelism.

Quantization

Quantization reduces the number of bits needed to store the model’s parameters (ex: weights, activations, and gradients). This technique reduces inference latency by exchanging memory for accuracy.

Attention and its variants

Review of Queries, Keys, and Values:

Queries: Represent the context or question.

Keys: Represent the information being attended to.

Values: Represent the information being retrieved.

Attention weights are computed by comparing queries with keys, and then used to weight values, producing the final output representation.

Query (Prompt) → Attention Weights → Relevant Information (Values)

Attention Variant	Description	Visual Representation
Scaled Dot-Product Attention	Scaled Dot-Product Attention (SDPA) is a key component of the Transformer architecture, and is a type of self-attention mechanism that allows the model to attend to different parts of the input sequence simultaneously and weigh their relevance.
Multi-Head Attention	In Multi-Head Attention (MHA), multiple SDPA heads operate in parallel resulting in richer relationships between different aspects of the input sequence to be captured.
Multi-Query Attention	Multi-Query Attention (MQA) is a memory-efficient refinement to MHA where one key-value head is shared across multiple attention heads. MQA reduces the size of KV-cache memory allowing space for larger batch sizes. While MQA results in faster decoder inference than MHA, there may be quality degradation.
Grouped-Query Attention	Grouped-Query Attention (GQA) modifies MQA to have groupings of queries for multiple key-value heads. The number of key-value heads is more than one as in MQA, but less than the number of queries as in MHA. By striking a balance between MQA and MHA, GQA achieves comparable quality to MHA, while also achieving speed similar to MQA.

	Sliding Window Attention (SWA) or local attention, restricts attention to a fixed-size window that slides over the sequence. While SWA is not efficient to scale to long inputs, Character AI saw that speed and quality wasn’t impacted with long sequences when interleaving SWA and global attention, with adjacent global attention layers sharing a KV cache (cross-layer attention).

Local Attention vs. Global Attention

Local and global attention mechanisms differ in key aspects. Local attention uses less computation (O(n * w)) and memory by focusing on token windows, enabling faster inference especially for long sequences, but may miss long-range dependencies. Global attention, while computationally more expensive (O(n^2)) and memory-intensive due to processing all token pairs, is able to better capture full context and long-range dependencies at the cost of slower inference speed.

Paged Attention

Inspired by virtual memory allocation, PagedAttention proposed a framework for optimizing KV cache that takes the variation of the number of tokens across requests into consideration.

FlashAttention

There are three variations of FlashAttention, with FlashAttention-3 being the latest release and optimized for Hopper GPUs. Each iteration of this algorithm takes a hardware-aware approach to make the attention computation as fast as possible. Past articles written on FlashAttention include: Designing Hardware-Aware Algorithms: FlashAttention and FlashAttention-2

Model Architectures: Dense Models vs. Mixture of Experts

Dense LLMs are the standard where all parameters are actively engaged during inference.

Mixture of Experts (MoE) LLMs are composed of multiple specialized sub-networks with a routing mechanism. Because only relevant experts are activated for each input, improved parameter efficiency and faster inference than dense models is often observed.

Parallelism

Larger models often require multiple GPUs to run effectively. There are a number of different parallelization strategies that allow for multi-GPU inference.

Parallelism Type	Partions	Description	Purpose
Data	Data	Splits different batches of data across devices.	Distribution of memory and computation for large datasets that wouldn’t fit on a single device
Tensor	Weight Tensors	Splits tensors across multiple devices either row-wise or column-wise	Distribution of memory and computation for large tensors that wouldn’t fit on a single device
Pipeline	Model Layers (vertically)	Splits different stages of the full model pipeline in parallel	Improves throughput by overlapping computation of different model stages
Context	Input Sequences	Divides input sequences into segments across devices	Reduces memory bottleneck for long sequence inputs
Expert	MoE models	Splits experts, where each expert is a smaller model, across devices	Allows for larger models with improved performance by distributing computation across multiple experts
Fully Sharded Data	Data, model, optimizer, and gradients	Shards components across devices, processes data in parallel, and synchronizes after each training step. Parameters are fetched and reconstructed from shards as needed, used for computation, and then promptly discarded, reducing memory footprint.	Enables training of extremely large models that exceed the memory capacity of a single device by distributing both model parameters and activations.

Conclusion

It’s undeniable that inference is an exciting area of research and optimization. The field moves fast, and to keep up, inference needs to move faster. In addition to more agentic workflows, we’re seeing more dynamic inference strategies that allow models to “think longer” on harder problems. For example, reasoning models like OpenAI’s o1 model shows consistent performance improvements on challenging mathematical and programming tasks when more computational resources are devoted during inference.

Thanks so much for reading! This article is certainly not conclusive to all there is in inference optimization. Stay tuned for more exciting articles on this topic and adjacent ones.

References and Other Excellent Resources

Blog posts:

Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog

LLM Inference at scale with TGI

Looking back at speculative decoding (Google Research)

LLM Inference Series: 4. KV caching, a deeper look | by Pierre Lienhart | Medium

A Visual Guide to Quantization - by Maarten Grootendorst

Optimizing AI Inference at Character.AI

Optimizing AI Inference at Character.AI (Part Deux)

Papers:

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

Efficient Memory Management for Large Language Model Serving with PagedAttention

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

The LLama 3 Herd of Models

Context Parallelism for Scalable Million-Token Inference

Talks:

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

NVIDIA CEO Jensen Huang Keynote at CES 2025

Building Machine Learning Systems for a Trillion Trillion Floating Point Operations :: Jane Street

Dylan Patel - Inference Math, Simulation, and AI Megaclusters - Stanford CS 229S - Autumn 2024

How does batching work on modern GPUs?

GitHub Links:

Sharan Chetlur -Nvidia/Presentation Slides - High Performance LLM Serving on Nvidia GPUs

GitHub - huggingface/search-and-learn: Recipes to scale inference-time compute of open models

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Melani Maheswaran

Author

See author profile

Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.

See author profile

Category:

Tutorial

Tags:

AI/ML