Distributed machine learning is complicated, and when combined with deep learning models that are also complex, it can make just getting anything to work into a research project. Add in setting up your GPU hardware and software, and it may become too much to take on.
Here we show that Hugging Face’s Accelerate library removes some of the burden of using a distributed setup, while still allowing you to retain all of your original PyTorch code.
In order to follow along with this article, you will need experience with Python code, and a beginners understanding of Deep Learning. We will operate under the assumption that all readers have access to sufficiently powerful machines, so they can run the code provided.
If you do not have access to a multi-GPU system, we suggest accessing it through the cloud. There are many cloud providers that offer GPUs. DigitalOcean GPU Droplets are currently in Early Availability, learn more and sign up for interest in GPU Droplets here.
For instructions on getting started with Python code, we recommend trying this beginners guide to set up your system and preparing to run beginner tutorials.
Accelerate is a library from Hugging Face that simplifies turning PyTorch code for a single GPU into code for multiple GPUs, on single or multiple machines. You can read more about Accelerate on their GitHub repository here.
With state-of-the-art deep learning at the cutting edge, we may not always be able to avoid complexity in the real data or models, but we can reduce the difficulty in running them on GPUs, and on more than one GPU at once.
Several libraries exist to do this, but often they either provide higher-level abstractions that remove fine-grained control from the user, or provide another API interface that needs to be learned first before it can be used.
This is what inspired the motivation of Accelerate: to allow users who need to write fully general PyTorch code to be able to do so, while reducing the burden of running such code in a distributed manner.
Another key capability provided by the library is that a fixed form of code can be run either distributed or not. This is different from the traditional PyTorch distributed launch that has to be changed to go from one to the other, and back again.
If you need to use fully general PyTorch code, it is likely that you are writing your own training loop for the model.
A typical PyTorch training loop goes something like this:
There may be other steps too, like data preparation, or running the model on test data, depending on the problem being solved.
In the readme for the Accelerate GitHub repository, the code changes compared to regular PyTorch for a training loop like the above are illustrated, via highlighting of the lines to be changed:
Code changes for a training loop using Accelerate versus original PyTorch. (From the Accelerate GitHub repository README)
Green means new lines that are added, and red means lines that are removed. We can see how the code corresponds to the training loop steps outlined above, and the changes needed.
At first glance, the changes don’t appear to simplify the code much, but if you imagine the red lines are gone, you can see that we are no longer talking about what device we are on (CPU, GPU, etc.). It has been abstracted away, while leaving the rest of the loop intact.
In more detail, the code changes are:
The code above is for a single GPU.
On their Hugging Face blog entry, the Accelerate authors then show how PyTorch code needs to be changed to enable multi-GPU using the traditional method.
It includes many more lines of code:
import os
...
from torch.utils.data import DistributedSampler
from torch.nn.parallel import DistributedDataParallel
local_rank = int(os.environ.get("LOCAL_RANK", -1))
...
device = device = torch.device("cuda", local_rank)
...
model = DistributedDataParallel(model)
...
sampler = DistributedSampler(dataset)
...
data = torch.utils.data.DataLoader(dataset, sampler=sampler)
...
sampler.set_epoch(epoch)
...
The resulting code then no longer works for a single GPU.
In contrast, code using Accelerate already works for multi-GPU, and continues to work for a single GPU as well.
The Accelerate GitHub repository shows how to run the library, via a well documented set of examples.
To run, start a Jupyter Notebook in the usual way.
Let’s see the example.
Hugging Face was founded on making Natural Language Processing (NLP) easier to access for people, so NLP is an appropriate place to start.
Then there are a some short setup steps
pip install accelerate
pip install datasets transformers
pip install scipy sklearn
and we can proceed to the example
cd examples
python ./nlp_example.py
This performs fine-tuning training on the well-known BERT transformer model in its base configuration, using the GLUE MRPC dataset concerning whether or not a sentence is a paraphrase of another.
It outputs an accuracy of about 85% and F1 score (combination of precision and recall) of just below 90%. So the performance is decent.
For multi-GPU, the simplifying power of the library Accelerate really starts to show, because the same code as above can be run.
Then, to invoke the script for multi-GPU, do:
pip install accelerate datasets transformers scipy sklearn
and run through its brief configuration steps to tell it how to be run here:
accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]: 2
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: no
Note that we say 1 machine, because our 2 GPUs are on the same machine, but we confirm that 2 GPUs are to be used.
Then we can run as before, now using the launch
command instead of python
to tell Accelerate to use the config that we just set:
accelerate launch ./nlp_example.py
You can see that both GPUs are being used by running nvidia-smi
in the terminal.
As hinted at by the configuration file setup above, we have only scratched the surface of the library’s features.
Some other features that is has include:
.ipynb
Jupyter notebookThere is also another machine learning example that you can run; it’s similar to the NLP task that we have been running here, but for computer vision. It trains a ResNet50 network on the Oxford-IIT Pet Dataset.
On our Notebook, you can add the following to a code cell to quickly run the example:
pip install accelerate datasets transformers scipy sklearn
pip install timm torchvision
cd examples
wget https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
tar -xzf images.tar.gz
python ./cv_example.py --data_dir images
We have shown how the Accelerate library from Hugging Face simplifies running PyTorch deep learning models in a distributed manner compared to traditional PyTorch, without removing the fully general nature of the user’s code.
For some next steps:
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!