The author selected the IEEE Foundation to receive a donation as part of the Write for DOnations program.
Sentiment analysis, strongly related to text mining and natural language processing, extracts qualitative assessment from written reviews. Many people read movie reviews to assess how good a movie seems to be among the general population. While assigning a number or star rating to a film may not indicate its quantitative success or failure, a collection of film reviews offers a qualitative perspective on these films. A textual movie review can identify what viewers believe to be the film’s good and poor elements. A more in-depth examination of the review will often reveal if the film lives up to the reviewer’s expectations. Sentiment analysis can be used to assess the reviewer’s perspective on subjects or the overall polarity of the review.
To conduct sentiment analysis, you would run a computational program to recognize and categorize opinions in a piece of text, such as to discern whether the writer (or reviewer) has a positive or negative attitude towards a given topic (in this case, a film). As a sub-domain of opinion mining, sentiment analysis focuses on extracting emotions and opinions towards a particular topic from structured, semi-structured, or unstructured textual data. As with other opinion mining models, you might use sentiment analysis to monitor brand and product opinions and to understand customer needs. Sentiment analysis focuses on the polarity of a text (positive, negative, or neutral), as well as detecting specific feelings and emotions of the reviewer (angry, happy, sad, and so on as defined by the model), urgency, and even intentions (interested or not interested).
In this tutorial, you will build a neural network that predicts the sentiment of film reviews with keras
. Your model will categorize the reviews into two categories (positive or negative) using the International Movie Database (IMDb) review dataset, which contains 50,000 movie reviews. By the end of this tutorial, you will have created a deep learning model and trained a neural network to perform sentiment analysis.
One Ubuntu 22.04 server instance with at least 8GB RAM. This server will need a non-root user with sudo privileges and a firewall configured, which you can set up by following our initial server setup guide.
Python 3, pip, and the Python venv
module installed on the server, which you can set up by following Steps 1 and 2 of our tutorial on How To Install Python 3 and Set Up a Programming Environment.
Jupyter notebook installed and running on a remote server, which you can set up with How to Install, Run, and Connect to Jupyter Notebook on a Remote Server.
A modern web browser that you will use to access Jupyter Notebook.
A fundamental understanding of machine learning and deep learning models. Learn more in An Introduction to Machine Learning.
Jupyter Notebook provides an interactive computational environment, so it is often used to run deep learning models rather than Python in a command line terminal. With Jupyter Notebook, commands and outputs appear in one notebook, enabling you to document your thoughts while developing the analysis process.
To follow this tutorial in your Jupyter Notebook, you will need to open a new Notebook and install the required dependencies, which you will do in this step.
Note: If you are following the tutorial on a remote server, you can use port forwarding to access your Jupyter Notebook in the browser of your local machine.
Open a terminal and enter the following command:
- ssh -L 8888:localhost:8888 your_non_root_user@your_server_ip
Upon connecting to the server, navigate to the link provided by the output to access your Jupyter Notebook. Keep this terminal open throughout the remainder of this tutorial.
You set up a Jupyter Notebook environment on your server in the prerequisites. Once you have logged in to your server, activate the virtual environment:
- source ~/environments/my_env/bin/activate
Then run the Jupyter Notebook application to start the application:
- jupyter notebook
After running and connecting to it, you will access a user interface in your browser. From the New dropdown menu, select the Python3(ipykernel) option, which will open a new tab with an untitled Python notebook. Name the file neural_network.ipynb
since you will run your code in this file.
Then, in the first cell of your browser’s Jupyter Notebook, use pip
to install the necessary dependencies for processing your data:
!pip install numpy
!pip install tensorflow
The numpy
dependency is used to manipulate arrays in linear algebra. In this tutorial, you will use it to manipulate the IMDb dataset in its array form by calling these functions:
The concatenate
function to join the sequence of test data arrays to the training data arrays.
The unique
function to find the unique elements in the dataset array.
The zeros
function to return a new array filled with zeros when you vectorize the dataset.
The tensorflow
dependency allows you to train and deploy your deep learning model in Python. Installing tensorflow
also installs keras
, which runs on top of TensorFlow and introduces a level of abstraction between TensorFlow and the user to enable the fast-paced development of deep learning models. This tutorial uses tensorflow
and keras
for the entire sentiment analysis training and deployment process.
After adding the two commands to your Jupyter Notebook, press the Run button to run them.
Your Jupyter Notebook will provide a running output to indicate that each dependency is being downloaded. A new input cell will be available below this output, where you will run the next lines of code.
When the dependencies have finished downloading, you will import them. Add these lines to the next cell and then press Run:
import numpy as np
from keras.utils import to_categorical
from keras import models
from keras import layers
Note: You might receive a warning about TensorFlow and TensorRT libraries when running these commands. Generally, Tensorflow works with CPUs, GPUs, and TPUs. The warning error states that the version of TensorFlow installed can use the AVX and AVX2 operations, which can speed up the process. This warning is not an error but a note that TensorFlow will take advantage of your CPU for additional speed.
The keras
tool installation includes the IMDB database built-in. The dataset has a 50/50 train/test split. For this tutorial, you will set up an 80/20 split to prevent overtraining the neural network. As such, you will merge the data into data and targets after downloading so you can do an 80/20 split later in the tutorial. Add these lines to a new cell and press Run:
from keras.datasets import imdb
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)
data = np.concatenate((training_data, testing_data), axis=0)
targets = np.concatenate((training_targets, testing_targets), axis=0)
This cell imports the IMDb dataset and joins the training data with the test data. By default, the dataset is split into training_data
, training_targets
, testing_data
, and testing_targets
.
Your Jupyter Notebook will feature an activity log and will take a few moments to download the dataset:
OutputDownloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17464789/17464789 [==============================] - 0s 0us/step
In this step, you prepared your Jupyter Notebook environment so that you can investigate this dataset for patterns, assumptions, and test anomalies. Next, you will perform exploratory data analysis on the entire dataset.
Now you will assess the dataset to identify how to train your model with this data. Conducting an exploratory data analysis on your dataset will clarify the underlying structure of a dataset. This process may expose trends, patterns, and relationships that are not readily apparent. This information can help you detect mistakes, debunk assumptions, and understand the relationships between key variables. Such insights may eventually lead to the selection of an appropriate predictive model.
Your first task with this dataset is to retrieve the output types and the number of unique words. To get this information, run the following lines in a new cell:
print("The output categories are", np.unique(targets))
print("The number of unique words is", len(np.unique(np.hstack(data))))
This cell prints the number of unique sentiments in the dataset (positive [1
] or negative [0
]) and the number of unique words used in a review.
The following output will print:
OutputThe output categories are [0 1]
The number of unique words is 9998
The first line of this output states that the movies are labeled either positive (1
) or negative (0
). The second line states that there are 9998 unique words in the dataset.
Next, you will ascertain the average length of words for the movie reviews and the standard deviation of words. To do so, run these lines of code in a new cell:
length = [len(i) for i in data]
print("The Average Review length is", np.mean(length))
print("The Standard Deviation is", round(np.std(length)))
This cell will print the average review length and standard deviation for the dataset:
OutputThe Average Review length is 234.75892
The Standard Deviation is 173
This assessment indicates that the average review length is 234 words with a standard deviation of 173.
Next, you will print an element of the dataset (the first index) by running these lines in a new cell:
print("Label:", targets[0])
print(data[0])
A movie review-label pair for the first element in the dataset will print:
OutputLabel: 1
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
This output provides the dataset’s first review, marked as positive (1
), and the full text as an integer index. By default, text reviews are given in their numerical encoded form as a list of integer-based word indices. The words in the reviews are indexed based on how frequently they appear across the entire dataset. For instance, the second most common term in the data is encoded by the integer 2
.
You will next retrieve the dictionary, mapping word indices back into the original words so that you can read the text review. Run these lines in a new cell:
index = imdb.get_word_index()
reverse_index = dict([(value, key) for (key, value) in index.items()])
decoded = " ".join( [reverse_index.get(i - 3, "#") for i in data[0]] )
print(decoded)
This code block will decode the numerical form into readable text. With the get_word_index()
function, you retrieve a dict mapping words to their index in the IMDb dataset. The reverse_index
variable then holds a dict that maps indices to words after reversing the word index. The dict()
function creates a dictionary that stores data values in key-value pairs. The last two lines of code will decode and print the first sequence in the dataset.
Using the get_word_index()
function, you will receive the following output:
Output# this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert # is an amazing actor and now the same being director # father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for # and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also # to the two little boy's that played the # of norman and paul they were just brilliant children are often left out of the # list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
The get_word_index()
function decodes the numerical data for this review into readable words and replaces every unrecognizable word with a #
.
In this step, you have assessed the dataset, reviewing how each review is prepared. With this information, you will now prepare the data to be trained.
In this step, you will prepare the dataset for training. Deep learning models generally develop from the data they are trained on.
When preparing this data, you will format it in a precise way to yield valuable insights that lead to more accurate model outcomes. Some techniques for data preparation include feature selection (selecting the features relevant to the model), feature engineering (converting variables in your dataset into useful features using encoding methods), and splitting your dataset into train and test sets.
In this tutorial, you will split the data into test and train sets and perform feature engineering by vectorizing the data.
In the next cell, run the following lines of code to vectorize every review in the dataset:
def vectorize(sequences, dimension = 10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1
return results
data = vectorize(data)
targets = np.array(targets).astype("float32")
First, you vectorize and fill each review so that each review will contain exactly 10,000 numbers. Through this process, you fill every review that is shorter than 10,000 words with zeros because the largest review in the dataset is about that length and the neural network requires that every input be the same size.
The vectorize()
function takes two parameters: an array and a preset dimension of 10000
. This function calls the zeros()
method from numpy
, which returns a new array filled with zeros and then encodes the rest of the data. With the last two lines of code, you will call the defined function on the dataset and then convert the target column of your dataset to a 32-bit float number. A 32-bit float is a floating point number with about seven digits of precision. Converting the target column to a 32-bit float number will increase the column’s precision and the deep learning model by extension.
As the final step in preparing your data, you will split your data into training and testing sets. You will put 40,000 reviews in the training set and 10,000 in the testing set, providing the 80/20 split that was described earlier.
Split your dataset by running these commands in a new cell:
test_x = data[:10000]
test_y = targets[:10000]
train_x = data[40000:]
train_y = targets[40000:]
The dataset has been split into test and train sets in a 1:4 ratio, with the targets in the train and test set saved as train_y
and test_y
and the reviews in the train and test set saved as train_x
and test_y
, respectively. In addition to testing and evaluating your model with new data, splitting the dataset means that your model will avoid overfitting, which is when an algorithm models the training data too well.
In this step, you prepared the data and separated the dataset into train and test sets. You transformed the raw data into features that can be used for a deep learning model. With your data prepared for training, you will now build and train the neural network that your deep learning model will use.
You can now build your neural network.
You will start by defining the type of model you want to build. There are two types of models available in keras
: the Model Sequential API and the Functional API. In this tutorial, you will use the Sequential API because it allows you to create models layer-by-layer.
Note: For more complex deep learning models, you should use the Functional API because Sequential API does not allow you to create models that share layers or have multiple inputs or outputs. However, for this tutorial, the Sequential API will suffice.
Set your model as the sequential model by running this command in a new cell:
model = models.Sequential()
Note: You may receive another TensorFlow error at this point, stating to rebuild TensorFlow with the appropriate compiler tags. This error is related to the previous error and results because TensorFlow 2.x packages support both CPU and GPU, so TensorFlow is looking for the GPU drivers. You can safely ignore this warning, as it will not impact the result of the tutorial.
Because layers form the foundation of deep learning models, you will next add the input, hidden, and output layers. Between them, you will use dense
on every layer and dropout
to prevent overfitting.
Run these lines in a new cell to add the layers:
# Input - Layer
model.add(layers.Dense(50, activation = "relu", input_shape=(10000, )))
# Hidden - Layers
model.add(layers.Dropout(0.3, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
model.add(layers.Dropout(0.2, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
# Output- Layer
model.add(layers.Dense(1, activation = "sigmoid"))
model.summary()
You will use the relu
function within the hidden layers because it yields an acceptable result. relu
stands for Rectified Linear Unit, and the function returns 0
if it receives any negative input or the value back for any positive value. Adding this function to your first layer removes the negative values by returning 0
. In this tutorial, the relu
function ensures all the values entering the input layer are positive values, which are necessary for the neural inputs.
At the output layer, you will use the sigmoid
function, which maps the values 0
and 1
. Since the output is positive (1
) or negative (0
), the sigmoid
function will ensure the output layer produces an output that is either 0
or 1
.
Lastly, you will let keras
print a summary of the model you have just built.
You will receive a summary of the features of the model you have just trained:
OutputModel: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 50) 500050
dropout (Dropout) (None, 50) 0
dense_1 (Dense) (None, 50) 2550
dropout_1 (Dropout) (None, 50) 0
dense_2 (Dense) (None, 50) 2550
dense_3 (Dense) (None, 1) 51
=================================================================
Total params: 505,201
Trainable params: 505,201
Non-trainable params: 0
_________________________________________________________________
Next, you will compile and configure the model for training. You will use the adam
optimizer, which is an algorithm that changes the weights and biases during training, with binary-crossentropy
as the loss and accuracy
as the evaluation metric. The loss function will compute the quantity the model should seek to minimize during training. You choose binary-crossentropy
in this instance because the cross-entropy loss between true and predicted labels is an excellent measure for binary (0
or 1
) classification applications.
To compile the model, run the following lines in the next cell:
model.compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = ["accuracy"]
)
The compile()
function defines the model architecture. You implement the adam
algorithm as your model’s optimizer in this definition. This algorithm is a gradient descent method based on approximating first-order and second-order moments. The metrics and loss parameters are very closely related. While the metric parameter defines how to judge your model’s performance, the loss parameter defines the quantity that the model seeks to minimize during training. The metric here is accuracy
(the fraction of predictions the model gets correctly), while the loss is binary_crossentropy
(a metric between the labels and predictions used when there are only two label classes [the positive 1
and the negative 0
]).
Now, you will be able to train your model. To do this, you will use a batch_size
of 32
and just two epochs. The batch size is the number of samples that will be propagated through the neural network, and an epoch is an iteration over the entire training data. A larger batch size generally implies faster training but sometimes converges slower. Conversely, a smaller batch size is slower in training but can converge faster.
You will now start training your model to get all the parameters to the correct value to map your inputs to your input. Run these lines in the next cell:
results = model.fit(
train_x, train_y,
epochs= 2,
batch_size = 32,
validation_data = (test_x, test_y)
)
You will train your model using the .fit()
function. This function trains the deep learning model for a fixed number of iterations on a dataset. The function takes in two required parameters:
train_x
refers to the input data.train_y
refers to the target data for the training set on which the data will be trained and can take other parameters.The other parameters include the following:
epochs
(the number of epochs to train the model where an epoch is an iteration over the entire data provided).batch_size
(the number of samples per gradient update).validation_data
(the data on which the model will evaluate the loss at the end of each epoch).This code trains the model using two epochs and a batch size of 32
, which means that the entire dataset will be passed through the neural network twice with 32 training examples used in each iteration. The validation data is given as test_x
and test_y
.
Note: Training a sentiment analysis model is RAM-intensive. If you run this tutorial on an 8GB RAM server, you may receive the following warning: Allocation of x exceeds 10% of free system memory
. You can ignore this warning when it occurs and continue with the tutorial, because it only states that the training takes a sizeable amount of the free system memory and has no effect on the rest of the tutorial.
In this step, you built your deep learning model and trained it on the dataset you prepared. Next, you will evaluate the model’s performance against a different dataset using the validation data generated in this step.
You will evaluate the model in this step. Model evaluation is integral to the machine learning improvement and development process. This evaluation helps to find the best model that represents your data and how well the chosen model works.
There are four primary model evaluation metrics for a machine learning classification model: accuracy, precision, recall, and F1 score.
Accuracy is a commonly used performance metric because it evaluates the fraction of predictions your model got right. Accuracy is determined by dividing the number of correct predictions by the total number of predictions. You will use accuracy to assess your model in this tutorial.
Precision, in the context of this model, would refer to positive movie reviews that were correctly predicted against the total number of predicted positive movie reviews. Recall is the division of positive movie reviews that were correctly predicted by the total number of movie review assessed in the dataset. Precision answers the question, Of all movie reviews that your model marked as positive, how many were actually positive?
. In contrast, recall answers the question, Of all movie reviews that are truly positive, how many did your model mark as positive?
. The F1 score is the weighted average of the precision and recall results. As such, it takes into account any reviews that have been miscategorized.
In this tutorial, you will evaluate your model performance using its accuracy. Run these lines of code in the next cell:
scores = model.evaluate(test_x, test_y, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
This code stores the model’s accuracy score in a variable called scores
and prints it to the screen. The .evaluate
function takes three parameters
x_test
: the feature column for the test dataset.y_test
: the target column for the test dataset.verbose
: the mode of verbosity.The following output will print with the accuracy rate:
OutputAccuracy: 86.59%
The accuracy of this trained model is 86.59%.
This accuracy score indicates that this model correctly predicts if a review is positive or negative about nine out of ten times. You can continue to work on your code to try to make your classifier perform with better accuracy. To make your model perform better and improve accuracy, you can increase the number of epochs or the batch size for your model.
Deep learning models (and machine learning models) are as powerful as the data you feed them. As such, increasing the accuracy of your model is often achieved by adding more data. However, the data used for this model is in-built and cannot be modified. In this case, you can improve the model’s accuracy by adding more layers to your model or increasing the number of epochs (the number of times you pass the entire dataset through the neural network) in Step 4.
To increase the number of epochs, replace the number of epochs
in the model.fit()
cell from 2
to 3
(or another number), then rerun that cell and the cells that follow:
results = model.fit(
train_x, train_y,
epochs= 3,
batch_size = 32,
validation_data = (test_x, test_y)
)
The number of epochs has been increased, which means that the training data will pass through the neural network three times in total, and the model will have an additional opportunity to learn from the data. When you rerun the model.evaluate()
function, you will receive a new output with an updated accuracy rate.
In this step, you evaluated the model you built by computing its accuracy. After the initial computation, you increased the number of epochs to improve the model and reevaluate the accuracy score.
In this tutorial, you trained a neural network to categorize the sentiment of movie reviews as positive or negative using keras
. You used the IMDb sentiment classification dataset collected by Stanford University researchers. The dataset contains one of the keras
pre-downloaded datasets for binary sentiment classification. You can access the dataset, a set of 25,000 highly polarized movie reviews for training and another 25,000 for testing. You reviewed this dataset to develop a large neural network model for sentiment analysis.
Now that you have built and trained a neural network, you can try this implementation with your own data or test it on other popular datasets. You could experiment with the other keras
datasets or try different algorithms.
To build on your keras
and TensorFlow experience, you can follow our tutorial on How To Build a Deep Learning Model to Predict Employee Retention Using Keras and TensorFlow.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!