Oftentimes, one of the most asked questions by new data scientists and ML engineers is whether their deep learning training processes are running optimally. In this guide, we will learn how to diagnose and fix deep learning performance issues regardless of whether we are working on one or numerous machines. This is to help us understand how to make practical and effective use of the wide variety of available cloud GPUs.
We will start by understanding what GPU utilization is, and we’ll finish by discussing the optimal batch size for maximum GPU utilization.
Note: This guide assumes we have basic understanding of the Linux operating system and the Python programming language. The latest Linux distros come with Ubuntu pre-installed, so we can go ahead and install
pip
andconda
, as we will use them here.
In order to follow along with this article, you will need experience with Python code, and a beginners understanding of Deep Learning. We will operate under the assumption that all readers have access to sufficiently powerful machines, so they can run the code provided.
If you do not have access to a GPU, we suggest accessing it through the cloud. There are many cloud providers that offer GPUs. DigitalOcean GPU Droplets are currently in Early Availability, learn more and sign up for interest in GPU Droplets here
For instructions on getting started with Python code, we recommend trying this beginners guide to set up your system and preparing to run beginner tutorials.
In machine and deep learning training sessions, GPU utilization is the most important aspect to observe, and is available through notable GPU third party and built in tools.
We can define GPU’s utilization as the speed that a single or multiple GPU kernels are operating over the last second, which is parallel to a GPU being used by a deep learning program. We could also say that
Let us look at a real scenario here,
In a typical day, a data scientist gets two GPUs that he/she can use – these “should” be sufficient resources. Most of the days during the build part, there’s no problem interacting with the GPU’s short cycles and the workflow is smooth. Then the training phase kicks in, and suddenly the workflow demands additional GPU compute that is not readily available.
This means that more compute resources will be required to do any sort of significvant work. We place particular emphasis on the following tasks as being impossible when all RAM is allocated:
In general, these upgrades transform into a double increase in the utilization of hardware and 100% increase in model training speed.
The general experience with batch size is always confusing because there is no single “best” batch size for a given data set and model architecture. If we decide to pick a larger batch size, it will train faster and consume more memory, but it might show lower accuracy in the end. First, let us understand what a batch size is and why you need it.
It is important to specify a batch size when it pertains to training a model like a deep learning neural network. Put simply, the batch size is the number of samples that will be passed through to a network at one time.
Let’s say we want to train our network to recognize different cat breeds using 1000 photos of cats. Let’s now assume that we have chosen a batch size of 10. Therefore, it means that at one moment, the network will get 10 photographs of cats as a group or a batch in our case.
Cool, we have the idea of batch size now, but what’s the point? We could just pass each data element individually to our model rather than putting the data in batches. We’ve explained why we need them in the section below.
We mentioned earlier that a larger batch size will help a model complete each epoch during training quickly. This is because, a machine may be able to produce much more than one single character at a time depending on the computational resources available.
However, even if our machine is capable of handling very larger batches, the final output of the model may degrade as we set our batch larger and may ultimately limit the model to generalize on new data.
We can now concur that a batch size is another hyper-parameter we need to assess and tweak depending on how a particular model is doing throughout training sessions. This setting will also need to be examined to see how well our machine utilizes the GPU when running different batch sizes.
For instance, if we set our batch size to a rather high amount, say 100, then it’s possible that our machine won’t have enough processing capacity to process all 100 images simultaneously. This would indicate that we need to reduce our batch size.
Now that we have understood a general idea of what a batch size is, let’s see how we can optimize the right batch size in code using PyTorch and Keras.
In this section we will run through finding the right batch size on a Resnet18
model. We will use the PyTorch profiler to measure the training performance and GPU utilization of the Resnet18
model.
In order to demonstrate more PyTorch usage on TensorBoard to monitor model performance, we will utilize the PyTorch profiler in this code but turn on extra options.
On your cloud GPU powered machine, use wget to download the corresponding notebook. Then, run Jupyter Labs to open the notebook. You can do this by pasting the following and opening the notebook link:
wget https://raw.githubusercontent.com/gradient-ai/batch-optimization-DL/refs/heads/main/notebook.ipynb
jupyter lab
Type the following command to install torch, torchvision, and Profiler.
pip3 install torch torchvision torch-tb-profiler
The following code will grab our dataset from CIFAR10 . Next, we will use transfer learning with the pre-trained model resnet18
and train the model.
#import all the necessary libraries
import torch
import torch.nn
import torch.optim
import torch.profiler
import torch.utils.data
import torchvision.datasets
import torchvision.models
import torchvision.transforms as T
#prepare input data and transform it
transform = T.Compose(
[T.Resize(224),
T.ToTensor(),
T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
# use dataloader to launch each batch
train_loader = torch.utils.data.DataLoader(train_set, batch_size=1, shuffle=True, num_workers=4)
# Create a Resnet model, loss function, and optimizer objects. To run on GPU, move model and loss to a GPU device
device = torch.device("cuda:0")
model = torchvision.models.resnet18(pretrained=True).cuda(device)
criterion = torch.nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()
# define the training step for each batch of input data
def train(data):
inputs, labels = data[0].to(device=device), data[1].to(device=device)
outputs = model(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
We have successfully setup our basic model, now we are going to enable the optional features in the profiler to record more information during the training process. Let’s include the following parameters:
schedule
- this parameter takes a single step(int)
, and returns the profiler action to perform at every stage.profile_memory
- This is used to allocate GPU memory and setting it to true may cost you additional time.with_stack
- used to record source information for all traces.Now that we understand these terms, we can return to the code:
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18_batchsize1'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, batch_data in enumerate(train_loader):
if step >= (1 + 1 + 3) * 2:
break
train(batch_data)
prof.step() # Need call this at the end of each step to notify profiler of steps' boundary.
We are going to use an arbitrary sequential model in this case;
model = Sequential([
Dense(units=16, input_shape=(1,), activation='relu'),
Dense(units=32, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
Dense(units=2, activation='sigmoid')
])
Let’s concentrate on where we call model.fit()
. This is the function where an artificial neural network learn and calls to train our model.
model.fit(
x=scaled_train_samples,
y=train_labels,
validation_data=valid_set,
batch_size=10,
epochs=20,
shuffle=True,
verbose=2
)
The fit()
function above here accepts a parameter called batch_size
. This is where we assign a value for our batch_size
variable. In this model, we have just set the value to 10
. Therefore, in the training of this model, we will be passing in 10
characters at a time until all the cycle is complete. Thereafter, we can begin the process over again to complete the next cycle.
When performing multi-GPU training, pay close attention to the batch size as it might affect speed/memory, convergence of your model, and if we’re not careful, our model weights could be corrupted!
Speed and memory - Without a doubt, training and prediction are performed more quickly with larger batches. Small batches incur higher overhead as a result of the overhead associated with loading and unloading data from the GPUs, but some studies indicate training with a small batch size will yield a higher overall, final efficacy scores for such models. On the other hand, you require additional GPU RAM for larger batches. A large batch size can result in out-of-memory issues since the inputs for each layer are retained in memory, especially during training when they are needed for the back-propagation step.
Convergence - If you train your model with stochastic gradient descent (SGD) or one of its variants, you should be aware that the batch size might have an impact on how well your network converges and generalizes. In many computer vision problems, batch sizes typically range from 32 to 512 instances.
Corrupting the GPUs - This irritating technical detail could have disastrous effects. When performing multi-GPU training, it’s crucial to provide data to each GPU. It is possible for your epoch’s final batch to include fewer data than expected (because the size of our dataset can not be divided exactly by the size of our batch).
Some GPUs may not get any data during the final step as a result of this. Sadly, some Keras Layers—most notably the Batch Normalization Layer—can’t handle that, which causes NaN values to appear in the weights (the running mean and variance in the BN layer).
To make matters worse, because the specific layer uses the batch’s mean/variance in the estimations, one will not notice the issue during training (when learning phase is 1). However, the running mean/variance is employed during predictions (learning phase set to 0), which in our scenario can become nan leading to subpar results.
Therefore, while performing multi-GPU training, we should always make sure that the batch size is fixed. Rejecting batches that don’t fit the predefined size or repeating the entries in the batch until it does are two straightforward approaches to accomplish this. Last but not least, remember that in a multi-GPU configuration, the batch size should be more than the total number of GPUs on your system.
In this article, we saw how to use various tools to maximize GPU utilization by finding the right batch size. As long as you set a respectable batch size (16+) and keep the iterations and epochs the same, the batch size has little impact on performance. Training time will be impacted, though. We should select the smallest batch size possible for multi-GPU so that each GPU can train with its full capacity. 16 per GPU is a good number.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!