Tutorial

Multi-GPU on raw PyTorch with Hugging Face’s Accelerate library

Updated on September 23, 2024
author

Nick Ball

Multi-GPU on raw PyTorch with Hugging Face’s Accelerate library

Distributed machine learning is complicated, and when combined with deep learning models that are also complex, it can make just getting anything to work into a research project. Add in setting up your GPU hardware and software, and it may become too much to take on.

Here we show that Hugging Face’s Accelerate library removes some of the burden of using a distributed setup, while still allowing you to retain all of your original PyTorch code.

Prerequisites

In order to follow along with this article, you will need experience with Python code, and a beginners understanding of Deep Learning. We will operate under the assumption that all readers have access to sufficiently powerful machines, so they can run the code provided.

If you do not have access to a multi-GPU system, we suggest accessing it through the cloud. There are many cloud providers that offer GPUs. DigitalOcean GPU Droplets are currently in Early Availability, learn more and sign up for interest in GPU Droplets here.

For instructions on getting started with Python code, we recommend trying this beginners guide to set up your system and preparing to run beginner tutorials.

What is Accelerate?

image

Accelerate is a library from Hugging Face that simplifies turning PyTorch code for a single GPU into code for multiple GPUs, on single or multiple machines. You can read more about Accelerate on their GitHub repository here.

Motivation

With state-of-the-art deep learning at the cutting edge, we may not always be able to avoid complexity in the real data or models, but we can reduce the difficulty in running them on GPUs, and on more than one GPU at once.

Several libraries exist to do this, but often they either provide higher-level abstractions that remove fine-grained control from the user, or provide another API interface that needs to be learned first before it can be used.

This is what inspired the motivation of Accelerate: to allow users who need to write fully general PyTorch code to be able to do so, while reducing the burden of running such code in a distributed manner.

Another key capability provided by the library is that a fixed form of code can be run either distributed or not. This is different from the traditional PyTorch distributed launch that has to be changed to go from one to the other, and back again.

Code changes to use Accelerate

If you need to use fully general PyTorch code, it is likely that you are writing your own training loop for the model.

Training Loop

A typical PyTorch training loop goes something like this:

  • Import libraries
  • Set device (e.g., GPU)
  • Point model to device
  • Choose optimizer (e.g., Adam)
  • Load dataset using DataLoader (so we can pass batches to the model)
  • Train model in loop (once round per epoch):
    • Point source data and targets to device
    • Zero the network gradients
    • Calculate output from model
    • Calculate loss (e.g., cross-entropy)
    • Backpropagate the gradient

There may be other steps too, like data preparation, or running the model on test data, depending on the problem being solved.

Code Changes

In the readme for the Accelerate GitHub repository, the code changes compared to regular PyTorch for a training loop like the above are illustrated, via highlighting of the lines to be changed:

image

Code changes for a training loop using Accelerate versus original PyTorch. (From the Accelerate GitHub repository README)

Green means new lines that are added, and red means lines that are removed. We can see how the code corresponds to the training loop steps outlined above, and the changes needed.

At first glance, the changes don’t appear to simplify the code much, but if you imagine the red lines are gone, you can see that we are no longer talking about what device we are on (CPU, GPU, etc.). It has been abstracted away, while leaving the rest of the loop intact.

In more detail, the code changes are:

  • Import the Accelerator library
  • Use the accelerator as the device, which can be CPU or GPU
  • Instantiate the model, without having to specify a device
  • Setup the model, optimizer, and data to be used by Accelerate
  • We don’t need to point source data and targets to device
  • Accelerator does the backpropagation step

Single-GPU

The code above is for a single GPU.

On their Hugging Face blog entry, the Accelerate authors then show how PyTorch code needs to be changed to enable multi-GPU using the traditional method.

It includes many more lines of code:

import os
...
from torch.utils.data import DistributedSampler
from torch.nn.parallel import DistributedDataParallel
local_rank = int(os.environ.get("LOCAL_RANK", -1))
...
device = device = torch.device("cuda", local_rank)
...
model = DistributedDataParallel(model)  
...
sampler = DistributedSampler(dataset)
...
data = torch.utils.data.DataLoader(dataset, sampler=sampler)
...
    sampler.set_epoch(epoch)  
    ...

The resulting code then no longer works for a single GPU.

In contrast, code using Accelerate already works for multi-GPU, and continues to work for a single GPU as well.

Running Accelerate

The Accelerate GitHub repository shows how to run the library, via a well documented set of examples.

To run, start a Jupyter Notebook in the usual way.

Let’s see the example.

Simple NLP example

Hugging Face was founded on making Natural Language Processing (NLP) easier to access for people, so NLP is an appropriate place to start.

Then there are a some short setup steps

pip install accelerate
pip install datasets transformers
pip install scipy sklearn

and we can proceed to the example

cd examples
python ./nlp_example.py

This performs fine-tuning training on the well-known BERT transformer model in its base configuration, using the GLUE MRPC dataset concerning whether or not a sentence is a paraphrase of another.

It outputs an accuracy of about 85% and F1 score (combination of precision and recall) of just below 90%. So the performance is decent.

Multi-GPU

For multi-GPU, the simplifying power of the library Accelerate really starts to show, because the same code as above can be run.

Then, to invoke the script for multi-GPU, do:

pip install accelerate datasets transformers scipy sklearn

and run through its brief configuration steps to tell it how to be run here:

accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0

Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2

How many different machines will you use (use more than 1 for multi-node training)? [1]: 1

Do you want to use DeepSpeed? [yes/NO]: no

Do you want to use FullyShardedDataParallel? [yes/NO]: no

How many GPU(s) should be used for distributed training? [1]: 2

Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: no

Note that we say 1 machine, because our 2 GPUs are on the same machine, but we confirm that 2 GPUs are to be used.

Then we can run as before, now using the launch command instead of python to tell Accelerate to use the config that we just set:

accelerate launch ./nlp_example.py

You can see that both GPUs are being used by running nvidia-smi in the terminal.

More features

As hinted at by the configuration file setup above, we have only scratched the surface of the library’s features.

Some other features that is has include:

Computer vision example

There is also another machine learning example that you can run; it’s similar to the NLP task that we have been running here, but for computer vision. It trains a ResNet50 network on the Oxford-IIT Pet Dataset.

On our Notebook, you can add the following to a code cell to quickly run the example:

pip install accelerate datasets transformers scipy sklearn
pip install timm torchvision
cd examples
wget https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
tar -xzf images.tar.gz
python ./cv_example.py --data_dir images

Conclusions and next steps

We have shown how the Accelerate library from Hugging Face simplifies running PyTorch deep learning models in a distributed manner compared to traditional PyTorch, without removing the fully general nature of the user’s code.

For some next steps:

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the authors
Default avatar
Nick Ball

author

Still looking for an answer?

Ask a questionSearch for more help

Was this helpful?
 
Leave a comment


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!

Sign up

Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

Featured on Community

Get our biweekly newsletter

Sign up for Infrastructure as a Newsletter.

Hollie's Hub for Good

Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.

Become a contributor

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

Welcome to the developer cloud

DigitalOcean makes it simple to launch in the cloud and scale up as you grow — whether you're running one virtual machine or ten thousand.

Learn more