Hello, readers! In this article, we will be focusing on Outlier Analysis in R programming, in detail.
So, let us begin!!
Before diving deep into the concept of outliers, let us focus on the pre-processing of data values.
In the domain of data science and machine learning, pre-processing of data values is a crucial step. By pre-processing, we mean to say, that getting all the errors and noise removed from the data prior to modeling.
In our last post, we had understood about missing value analysis in R programming.
Today, we would be a focus on an advanced level of the same - Outlier detection and removal in R.
Outliers, as the name suggests, are the data points that lie away from the other points of the dataset. That is the data values that appear away from other data values and hence disturb the overall distribution of the dataset.
This is usually assumed as an abnormal distribution of the data values.
Effect of Outliers on the model -
Having understood the effect of outliers, it is now the time to work on its implementation.
At first, it is very important for us to detect the presence of outliers in the dataset.
So, let us begin. We have made use of the Bike Rental Count Prediction dataset. You can find the dataset here!
Initially, we have loaded the dataset into the R environment using the read.csv() function.
Prior to outlier detection, we have performed missing value analysis just to check for the presence of any NULL or missing values. For the same, we have made use of sum(is.na(data))
function.
#Removed all the existing objects
rm(list = ls())
#Setting the working directory
setwd("D:/Ediwsor_Project - Bike_Rental_Count/")
getwd()
#Load the dataset
bike_data = read.csv("day.csv",header=TRUE)
### Missing Value Analysis ###
sum(is.na(bike_data))
summary(is.na(bike_data))
#From the above result, it is clear that the dataset contains NO Missing Values.
The data here contains NO missing values
Having said this now is the time to detect the presence of outliers in the dataset. To achieve this, we have saved the numeric data columns into a separate data structure/variable using c()
function.
Further, we have made use of boxplot()
function to detect the presence of outliers in the numeric variables.
BoxPlot:
From the visuals, it is clear that the variables ‘hum’ and ‘windspeed’ contain outliers in their data values.
Now, after performing outlier analysis in R, we replace the outliers identified by the boxplot() method with NULL values to operate over it as shown below.
##############################Outlier Analysis -- DETECTION###########################
# 1. Outliers in the data values exists only in continuous/numeric form of data variables. Thus, we need to store all the numeric and categorical independent variables into a separate array structure.
col = c('temp','cnt','hum','windspeed')
categorical_col = c("season","yr","mnth","holiday","weekday","workingday","weathersit")
# 2. Using BoxPlot to detect the presence of outliers in the numeric/continuous data columns.
boxplot(bike_data[,c('temp','atemp','hum','windspeed')])
# From the above visualization, it is clear that the data variables 'hum' and 'windspeed' contains outliers in the data values.
#OUTLIER ANALYSIS -- Removal of Outliers
# 1. From the boxplot, we have identified the presence of outliers. That is, the data values that are present above the upper quartile and below the lower quartile can be considered as the outlier data values.
# 2. Now, we will replace the outlier data values with NULL.
for (x in c('hum','windspeed'))
{
value = bike_data[,x][bike_data[,x] %in% boxplot.stats(bike_data[,x])$out]
bike_data[,x][bike_data[,x] %in% value] = NA
}
#Checking whether the outliers in the above defined columns are replaced by NULL or not
sum(is.na(bike_data$hum))
sum(is.na(bike_data$windspeed))
as.data.frame(colSums(is.na(bike_data)))
Now, we check for the presence of missing data i.e. whether the outlier values have been converted to missing values properly using the sum(is.na()) function.
Output:
> sum(is.na(bike_data$hum))
[1] 2
> sum(is.na(bike_data$windspeed))
[1] 13
> as.data.frame(colSums(is.na(bike_data)))
colSums(is.na(bike_data))
instant 0
dteday 0
season 0
yr 0
mnth 0
holiday 0
weekday 0
workingday 0
weathersit 0
temp 0
atemp 0
hum 2
windspeed 13
casual 0
registered 0
cnt 0
As a result, we have converted the 2 outlier points from the ‘hum’ column and 13 outlier points from the ‘windspeed’ column into missing(NA) values.
At last, we treat the missing values by dropping the NULL values using drop_na()
function from the ‘tidyr’ library.
#Removing the null values
library(tidyr)
bike_data = drop_na(bike_data)
as.data.frame(colSums(is.na(bike_data)))
Output:
As a result, all the outliers have been effectively removed now!
> as.data.frame(colSums(is.na(bike_data)))
colSums(is.na(bike_data))
instant 0
dteday 0
season 0
yr 0
mnth 0
holiday 0
weekday 0
workingday 0
weathersit 0
temp 0
atemp 0
hum 0
windspeed 0
casual 0
registered 0
cnt 0
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question. For more such posts related to R programming, stay tuned!!
Till then, Happy Learning!!:)
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
While we believe that this content benefits our community, we have not yet thoroughly reviewed it. If you have any suggestions for improvements, please let us know by clicking the “report an issue“ button at the bottom of the tutorial.