In data analysis, you may need to address missing values, negative values, or non-accurate values that are present in the dataset. These problems can be addressed by replacing the values with 0
, NA
, or the mean.
In this article, you will explore how to use the replace()
and is.na()
functions in R.
To complete this tutorial, you will need:
replace()
This section will show how to replace a value in a vector.
The replace()
function in R syntax includes the vector, index vector, and the replacement values:
replace(target, index, replacement)
First, create a vector:
df <- c('apple', 'orange', 'grape', 'banana')
df
This will create a vector with apple
, orange
, grape
, and banana
:
Output"apple" "orange" "grape" "banana"
Now, let’s replace the second item in the list:
dy <- replace(df, 2, 'blueberry')
dy
This will replace orange
with blueberry
:
Output"apple" "blueberry" "grape" "banana"
Now, we’ll replace the fourth item in the list:
dx <- replace(dy, 4, 'cranberry')
dx
This will replace banana
with cranberry
:
Output"apple" "blueberry" "grape" "cranberry"
NA
Values with 0
in RConsider a scenario where you have a data frame containing measurements:
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
Here is the data in CSV format:
Ozone,Solar.R,Wind,Temp,Month,Day
41,190,7.4,67,5,1
36,118,8.0,72,5,2
12,149,12.6,74,5,3
18,313,11.5,62,5,4
NA,NA,14.3,56,5,5
28,NA,14.9,66,5,6
23,299,8.6,65,5,7
19,99,13.8,59,5,8
8,19,20.1,61,5,9
NA,194,8.6,69,5,10
7,NA,6.9,74,5,11
16,256,9.7,69,5,12
This contains the string NA
for “Not Available” for situations where the data is missing.
You can replace the NA
values with 0
.
First, define the data frame:
df <- read.csv('air_quality.csv')
Use is.na()
to check if a value is NA
. Then, replace the NA
values with 0
:
df[is.na(df)] <- 0
df
The data frame is now:
Output Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 0 0 14.3 56 5 5
6 28 0 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 0 194 8.6 69 5 10
11 7 0 6.9 74 5 11
12 16 256 9.7 69 5 12
All occurrences of NA
in the data frame have been replaced.
NA
Values with the Mean of the Values in RIn the data analysis process, accuracy is improved in many cases by replacing NA
values with a mean value. The mean()
function calculates the mean value.
To overcome this situation, the NA
values are replaced by the mean of the rest of the values. This method has proven vital in producing good accuracy without any data loss.
Consider the following input data set with NA
values:
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
df <- read.csv('air_quality.csv')
Use is.na()
and mean()
to replace NA
:
df$Ozone[is.na(df$Ozone)] <- mean(df$Ozone, na.rm = TRUE)
First, this code finds all the occurrences of NA
in the Ozone
column. Next, it calculates the mean of all the values in the Ozone
column - excluding the NA
values with the na.rm
argument. Then each instance of NA
is replaced with the calculated mean.
Then round()
the values to whole numbers:
df$Ozone <- round(df$Ozone, digits = 0)
The data frame is now:
Output Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 21 NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 21 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
The NA
values in the Ozone
column are now replaced by the rounded mean of the values in the Ozone
column (21
).
0
or NA
in RIn the data analysis process, sometimes you will want to replace the negative values in the data frame with 0
or NA
. This is necessary to avoid the negative tendency of the results. The negative values present in a dataset will mislead the analysis and produce false accuracy.
Consider the following input data set with negative values:
count entry1 entry2 entry3
1 1 345 -234 345
2 2 65 654 867
3 3 23 345 3456
4 4 87 876 9
5 5 2345 34 867
6 6 876 98 76
7 7 35 -456 123
8 8 87 98 345
9 9 -765 67 765
10 10 4567 -87 234
Here is the data in CSV format:
count,entry1,entry2,entry3
1,345,-234,345
2,65,654,867
3,23,345,3456
4,87,867,9
5,2345,34,867
6,876,98,76
7,35,-456,123
8,87,98,345
9,-765,67,765
10,4567,-87,234
Read the CSV file:
df <- read.csv('negative_values.csv')
0
Use replace()
to change the negative values in the entry2
column to 0
:
data_zero <- df
data_zero$entry2 <- replace(df$entry2, df$entry2 < 0, 0)
data_zero
The data frame is now:
Output count entry1 entry2 entry3
1 1 345 0 345
2 2 65 654 867
3 3 23 345 3456
4 4 87 867 9
5 5 2345 34 867
6 6 876 98 76
7 7 35 0 123
8 8 87 98 345
9 9 -765 67 765
10 10 4567 0 234
The negative values in the entry2
column have been replaced with 0
.
NA
Use replace()
to change the negative values in the entry2
column to NA
:
data_na <- df
data_na$entry2 <- replace(df$entry2, df$entry2 < 0, NA)
data_na
The data frame is now:
Output count entry1 entry2 entry3
1 1 345 NA 345
2 2 65 654 867
3 3 23 345 3456
4 4 87 867 9
5 5 2345 34 867
6 6 876 98 76
7 7 35 NA 123
8 8 87 98 345
9 9 -765 67 765
10 10 4567 NA 234
The negative values in the entry2
column have been replaced with NA
.
Replacing values in a data frame is a convenient option available in R for data analysis. Using replace()
in R, you can switch NA
, 0
, and negative values when appropriate to clear up large datasets for analysis.
Continue your learning with How To Use sub()
and gsub()
in R.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.