The author selected Code Org to receive a donation as part of the Write for DOnations program.
Computer vision is a subfield of computer science that aims to extract a higher-order understanding from images and videos. This powers technologies such as fun video chat filters, your mobile device’s face authenticator, and self-driving cars.
In this tutorial, you’ll use computer vision to build an American Sign Language translator for your webcam. As you work through the tutorial, you’ll use OpenCV
, a computer-vision library, PyTorch
to build a deep neural network, and onnx
to export your neural network. You’ll also apply the following concepts as you build a computer-vision application:
By the end of this tutorial, you’ll have both an American Sign Language translator and foundational deep learning know-how. You can also access the complete source code for this project.
To complete this tutorial, you will need the following:
Let’s create a workspace for this project and install the dependencies we’ll need.
On Linux distributions, start by preparing your system package manager and install the Python3 virtualenv package. Use:
We’ll call our workspace SignLanguage
:
Navigate to the SignLanguage
directory:
Then create a new virtual environment for the project:
Activate your environment:
Then install PyTorch, a deep-learning framework for Python that we’ll use in this tutorial.
On macOS, install Pytorch with the following command:
On Linux and Windows, use the following commands for a CPU-only build:
Now install prepackaged binaries for OpenCV
, numpy
, and onnx
, which are libraries for computer vision, linear algebra, AI model exporting, and AI model execution, respectively. OpenCV
offers utilities such as image rotations, and numpy
offers linear algebra utilities such as a matrix inversion:
On Linux distributions, you will need to install libSM.so
:
With the dependencies installed, let’s build the first version of our sign language translator: a sign language classifier.
In these next three sections, you’ll build a sign language classifier using a neural network. Your goal is to produce a model that accepts a picture of a hand as input and outputs a letter.
The following three steps are required to build a machine learning classification model:
In this section of the tutorial, you will accomplish step 1 of 3. You will download the data, create a Dataset
object to iterate over your data, and finally apply data augmentation. At the end of this step, you will have a programmatic way of accessing images and labels in your dataset to feed to your model.
First, download the dataset to your current working directory:
Note: On macOS, wget
is not available by default. To do so, install Homebrew by following this DigitalOcean tutorial. Then, run brew install wget
.
Unzip the zip file, which contains a data/
directory:
Create a new file, named step_2_dataset.py
:
As before, import the necessary utilities and create the class that will hold your data. For data processing here, you will create the train and test datasets. You’ll implement PyTorch’s Dataset
interface, allowing you to load and use PyTorch’s built-in data pipeline for your sign language classification dataset:
Delete the pass
placeholder in the SignLanguageMNIST
class. In its place, add a method to generate a label mapping:
Labels range from 0 to 25. However, letters J (9) and Z (25) are excluded. This means there are only 24 valid label values. So that the set of all label values starting from 0 is contiguous, we map all labels to [0, 23]. This mapping from dataset labels [0, 23] to letter indices [0, 25] is provided by this get_label_mapping
method.
Next, add a method to extract labels and samples from a CSV file. The following assumes that each line starts with the label
and is then followed by 784 pixel values. These 784 pixel values represent a 28x28
image:
For an explanation of how these 784 values represent an image, see Build an Emotion-Based Dog Filter, Step 4.
Note that each line in the csv.reader
iterable is a list of strings; the int
and map(int, ...)
invocations cast all strings to integers. Directly beneath our static method, add a function that will initialize our data holder:
This function starts by loading the samples and labels. Then it wraps the data in NumPy arrays. The mean and standard deviation information will be explained shortly, in the __getitem__
section following.
Directly after the __init__
function, add a __len__
function. The Dataset
requires this method to determine when to stop iterating over data:
Finally, add a __getitem__
method, which returns a dictionary containing the sample and the label:
You use a technique called data augmentation, where samples are perturbed during training, to increase the model’s robustness to these perturbations. In particular, randomly zoom in on the image by varying amounts and on different locations, via RandomResizedCrop
. Note that zooming in should not affect the final sign language class; thus, the label is not transformed. You additionally normalize the inputs so that image values are rescaled to the [0, 1] range in expectation, instead of [0, 255]; to accomplish this, use the dataset _mean
and _std
when normalizing.
Your completed SignLanguageMNIST
class will look like the following:
As before, you will now verify our dataset utility functions by loading the SignLanguageMNIST
dataset. Add the following code to the end of your file after the SignLanguageMNIST
class:
This code initializes the dataset using the SignLanguageMNIST
class. Then for the train and validation sets, it wraps the dataset in a DataLoader
. This will translate the dataset into an iterable to use later.
Now you’ll verify that the dataset utilities are functioning. Create a sample dataset loader using DataLoader
and print the first element of that loader. Add the following to the end of your file:
You can check that your file matches the step_2_dataset
file in this (repository). Exit your editor and run the script with the following:
This outputs the following pair of tensors. Our data pipeline outputs two samples and two labels. This indicates that our data pipeline is up and ready to go:
Output{'image': tensor([[[[ 0.4337, 0.5022, 0.5707, ..., 0.9988, 0.9646, 0.9646],
[ 0.4851, 0.5536, 0.6049, ..., 1.0502, 1.0159, 0.9988],
[ 0.5364, 0.6049, 0.6392, ..., 1.0844, 1.0844, 1.0673],
...,
[-0.5253, -0.4739, -0.4054, ..., 0.9474, 1.2557, 1.2385],
[-0.3369, -0.3369, -0.3369, ..., 0.0569, 1.3584, 1.3242],
[-0.3712, -0.3369, -0.3198, ..., 0.5364, 0.5364, 1.4783]]],
[[[ 0.2111, 0.2796, 0.3481, ..., 0.2453, -0.1314, -0.2342],
[ 0.2624, 0.3309, 0.3652, ..., -0.3883, -0.0629, -0.4568],
[ 0.3309, 0.3823, 0.4337, ..., -0.4054, -0.0458, -1.0048],
...,
[ 1.3242, 1.3584, 1.3927, ..., -0.4054, -0.4568, 0.0227],
[ 1.3242, 1.3927, 1.4612, ..., -0.1657, -0.6281, -0.0287],
[ 1.3242, 1.3927, 1.4440, ..., -0.4397, -0.6452, -0.2856]]]]), 'label': tensor([[24.],
[11.]])}
You’ve now verified that your data pipeline works. This concludes the first step—preprocessing your data—which now includes data augmentation for increased model robustness. Next you will define the neural network and optimizer.
With a functioning data pipeline, you will now define a model and train it on the data. In particular, you will build a neural network with six layers, define a loss, an optimizer, and finally, optimize the loss function for your neural network predictions. At the end of this step, you will have a working sign language classifier.
Create a new file called step_3_train.py
:
Import the necessary utilities:
Define a PyTorch neural network that includes three convolutional layers, followed by three fully connected layers. Add this to the end of your existing script:
Now initialize the neural network, define a loss function, and define optimization hyperparameters by adding the following code to the end of the script:
Finally, you’ll train for two epochs:
You define an epoch to be an iteration of training where every training sample has been used exactly once. At the end of the main function, the model parameters will be saved to a file called "checkpoint.pth"
.
Add the following code to the end of your script to extract image
and label
from the dataset loader and then wrap each in a PyTorch Variable
:
This code will also run the forward pass and then backpropagate through the loss and neural network.
At the end of your file, add the following to invoke the main
function:
Double-check that your file matches the following:
Save and exit. Then, launch our proof-of-concept training by running:
You’ll see output akin to the following as the neural network trains:
Output[0, 0] loss: 3.208171
[0, 100] loss: 3.211070
[0, 200] loss: 3.192235
[0, 300] loss: 2.943867
[0, 400] loss: 2.569440
[0, 500] loss: 2.243283
[0, 600] loss: 1.986425
[0, 700] loss: 1.768090
[0, 800] loss: 1.587308
[1, 0] loss: 0.254097
[1, 100] loss: 0.208116
[1, 200] loss: 0.196270
[1, 300] loss: 0.183676
[1, 400] loss: 0.169824
[1, 500] loss: 0.157704
[1, 600] loss: 0.151408
[1, 700] loss: 0.136470
[1, 800] loss: 0.123326
To obtain lower loss, you could increase the number of epochs to 5, 10, or even 20. However, after a certain period of training time, the network loss will cease to decrease with increased training time. To sidestep this issue, as training time increases, you will introduce a learning rate schedule, which decreases learning rate over time. To understand why this works, see Distill’s visualization at “Why Momentum Really Works”.
Amend your main
function with the following two lines, defining a scheduler
and invoking scheduler.step
. Furthermore, change the number of epochs to 12
:
Check that your file matches the step 3 file in this repository. Training will run for around 5 minutes. Your output will resemble the following:
Output[0, 0] loss: 3.208171
[0, 100] loss: 3.211070
[0, 200] loss: 3.192235
[0, 300] loss: 2.943867
[0, 400] loss: 2.569440
[0, 500] loss: 2.243283
[0, 600] loss: 1.986425
[0, 700] loss: 1.768090
[0, 800] loss: 1.587308
...
[11, 0] loss: 0.000302
[11, 100] loss: 0.007548
[11, 200] loss: 0.009005
[11, 300] loss: 0.008193
[11, 400] loss: 0.007694
[11, 500] loss: 0.008509
[11, 600] loss: 0.008039
[11, 700] loss: 0.007524
[11, 800] loss: 0.007608
The final loss obtained is 0.007608
, which is 3 orders of magnitude smaller than the starting loss 3.20
. This concludes the second step of our workflow, where we set up and train the neural network. With that said, as small as this loss value is, it has little meaning. To put the model’s performance in perspective, we will compute its accuracy—the percentage of images the model correctly classified.
You will now evaluate your sign language classifier by computing its accuracy on the validation set, a set of images the model did not see during training. This will provide a better sense of model performance than the final loss value did. Furthermore, you will add utilities to save our trained model at the end of training and load our pre-trained model when performing inference.
Create a new file, called step_4_evaluate.py
.
Import the necessary utilities:
Next, define a utility to evaluate the neural network’s performance. The following function compares the neural network’s predicted letter to the true letter, for a single image:
outputs
is a list of class probabilities for each sample. For example, outputs
for a single sample may be [0.1, 0.3, 0.4, 0.2]
. labels
is a list of label classes. For example, the label class may be 3
.
Y = ...
converts the labels into a NumPy array. Next, Yhat = np.argmax(...)
converts the outputs
class probabilities into predicted classes. For example, the list of class probabilities [0.1, 0.3, 0.4, 0.2]
would yield the predicted class 2
, because the index 2 value of 0.4 is the largest value.
Since both Y
and Yhat
are now classes, you can compare them. Yhat == Y
checks if the predicted class matches the label class, and np.sum(...)
is a trick that computes the number of truth-y values. In other words, np.sum
will output the number of samples that were classified correctly.
Add the second function batch_evaluate
, which applies the first function evaluate
to all images:
batch
is a group of images stored as a single tensor. First, you increment the total number of images you’re evaluating (n
) by the number of images in this batch. Next, you run inference on the neural network with this batch of images, outputs = net(...)
. The type check if isinstance(...)
converts the outputs in a NumPy array if needed. Finally, you use evaluate
to compute the number of correctly-classified samples. At the conclusion of the function, you compute the percent of samples you correctly classified, score / n
.
Finally, add the following script to leverage the preceding utilities:
This loads a pretrained neural network and evaluates its performance on the provided sign language dataset. Specifically, the script here outputs accuracy on the images you used for training and a separate set of images you put aside for testing purposes, called the validation set.
You will next export the PyTorch model to an ONNX binary. This binary file can then be used in production to run inference with your model. Most importantly, the code running this binary does not need a copy of the original network definition. At the end of the validate
function, add the following:
This exports the ONNX model, checks the exported model, and then runs inference with the exported model. Double-check that your file matches the step 4 file in this repository:
To use and evaluate the checkpoint from the last step, run the following:
This will yield output similar to the following, affirming that your exported model not only works, but also agrees with your original PyTorch model:
Output========== PyTorch ==========
Training accuracy: 99.9
Validation accuracy: 97.4
========== ONNX ==========
Training accuracy: 99.9
Validation accuracy: 97.4
Your neural network attains a train accuracy of 99.9% and a 97.4% validation accuracy. This gap between train and validation accuracy indicates your model is overfitting. This means that instead of learning generalizable patterns, your model has memorized the training data. To understand the implications and causes of overfitting, see Understanding Bias-Variance Tradeoffs.
At this point, we have completed a sign language classifier. In essence, our model can correctly disambiguate between signs correctly almost all the time. This is a reasonably good model, so we move on to the final stage of our application. We will use this sign language classifier in a real-time webcam application.
Your next objective is to link the computer’s camera to your sign language classifier. You will collect camera input, classify the displayed sign language, and then report the classified sign back to the user.
Now create a Python script for the face detector. Create the file step_6_camera.py
using nano
or your favorite text editor:
Add the following code into the file:
This code imports OpenCV, which contains your image utilities, and the ONNX runtime, which is all you need to run inference with your model. The rest of the code is typical Python program boilerplate.
Now replace pass
in the main
function with the following code, which initializes a sign language classifier using the parameters you trained previously. Additionally add a mapping from indices to letters and image statistics:
You will use elements of this test script from the official OpenCV documentation. Specifically, you will update the body of the main
function. Start by initializing a VideoCapture
object that is set to capture live feed from your computer’s camera. Place this at the end of the main
function:
Then add a while
loop, which reads from the camera at every timestep:
Write a utility function that takes the center crop for the camera frame. Place this function before main
:
Next, take the center crop for the camera frame, convert to grayscale, normalize, and resize to 28x28
. Place this inside the while
loop within the main
function:
Still within the while
loop, run inference with the ONNX runtime. Convert the outputs to a class index, then to a letter:
Display the predicted letter inside the frame, and display the frame back to the user:
At the end of the while
loop, add this code to check if the user hits the q
character and, if so, quit the application. This line halts the program for 1 millisecond. Add the following:
Finally, release the capture and close all windows. Place this outside of the while
loop to end the main
function.
Double-check your file matches the following or this repository:
Exit your file and run the script.
Once the script is run, a window will pop up with your live webcam feed. The predicted sign language letter will be shown in the top left. Hold up your hand and make your favorite sign to see your classifier in action. Here are some sample results showing the letter L and D.
While testing, note that the background needs to be fairly clear for this translator to work. This is an unfortunate consequence of the dataset’s cleanliness. Had the dataset included images of hand signs with miscellaneous backgrounds, the network would be robust to noisy backgrounds. However, the dataset features blank backgrounds and nicely centered hands. As a result, this webcam translator works best when your hand is likewise centered and placed against a blank background.
This concludes the sign language translator application.
In this tutorial, you built an American Sign Language translator using computer vision and a machine learning model. In particular, you saw new aspects of training a machine learning model—specifically, data augmentation for model robustness, learning rate schedules for lower loss, and exporting AI models using ONNX for production use. This then culminated in a real-time computer vision application, which translates sign language into letters using a pipeline you built. It’s worth noting that combatting the brittleness of the final classifier can be tackled with any or all of the following methods. For further exploration try the following topics to in improve your application:
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
I think this will be of great help if integrated with soon-to-come AR/VR devices. If such neural tech was to be used in AR headset, it would solve the typing issue we got nowadays with these. I foresee a future where to “type” a text using a headset device, we’ll have to use sign language - or other gestures.