In this tutorial, you’ll learn how to use panda’s DataFrame dropna()
function.
NA
values are “Not Available”. This can apply to Null
, None
, pandas.NaT
, or numpy.nan
. Using dropna()
will drop the rows and columns with these values. This can be beneficial to provide you with only valid data.
By default, this function returns a new DataFrame and the source DataFrame remains unchanged.
This tutorial was verified with Python 3.10.9, pandas 1.5.2, and NumPy 1.24.1.
dropna()
takes the following parameters:
dropna(self, axis=0, how="any", thresh=None, subset=None, inplace=False)
axis
: {0 (or 'index'), 1 (or 'columns')}, default 0
0
, drop rows with missing values.1
, drop columns with missing values.how
: {'any', 'all'}, default 'any'
'any'
, drop the row or column if any of the values is NA
.'all'
, drop the row or column if all of the values are NA
.thresh
: (optional) an int
value to specify the threshold for the drop operation.subset
: (optional) column label or sequence of labels to specify rows or columns.inplace
: (optional) a bool
value.True
, the source DataFrame is changed and None
is returned.Construct a sample DataFrame that contains valid and invalid values:
import pandas as pd
import numpy as np
d1 = {
'Name': ['Shark', 'Whale', 'Jellyfish', 'Starfish'],
'ID': [1, 2, 3, 4],
'Population': [100, 200, np.nan, pd.NaT],
'Regions': [1, None, pd.NaT, pd.NaT]
}
df1 = pd.DataFrame(d1)
print(df1)
This code will print out the DataFrame:
OutputName ID Population Regions
0 Shark 1 100 1
1 Whale 2 200 None
2 Jellyfish 3 NaN NaT
3 Starfish 4 NaT NaT
Then add a second DataFrame with additional rows and columns with NA
values:
d2 = {
'Name': ['Shark', 'Whale', 'Jellyfish', 'Starfish', pd.NaT],
'ID': [1, 2, 3, 4, pd.NaT],
'Population': [100, 200, np.nan, pd.NaT, pd.NaT],
'Regions': [1, None, pd.NaT, pd.NaT, pd.NaT],
'Endangered': [pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT]
}
df2 = pd.DataFrame(d2)
print(df2)
This will output a new DataFrame:
OutputName ID Population Regions Endangered
0 Shark 1 100 1 NaT
1 Whale 2 200 None NaT
2 Jellyfish 3 NaN NaT NaT
3 Starfish 4 NaT NaT NaT
4 NaT NaT NaT NaT NaT
You will use the preceding DataFrames in the examples that follow.
Use dropna()
to remove rows with any None
, NaN
, or NaT
values:
dfresult = df1.dropna()
print(dfresult)
This will output:
OutputName ID Population Regions
0 Shark 1 100 1
A new DataFrame with a single row that didn’t contain any NA
values.
Use dropna()
with axis=1
to remove columns with any None
, NaN
, or NaT
values:
dfresult = df1.dropna(axis=1)
print(dfresult)
The columns with any None
, NaN
, or NaT
values will be dropped:
OutputName ID
0 Shark 1
1 Whale 2
2 Jellyfish 3
3 Starfish 4
A new DataFrame with a single column that contained non-NA
values.
all
the Values are Null
with how
Use the second DataFrame and how
:
dfresult = df2.dropna(how='all')
print(dfresult)
The rows with all
values equal to NA
will be dropped:
OutputName ID Population Regions Endangered
0 Shark 1 100 1 NaT
1 Whale 2 200 None NaT
2 Jellyfish 3 NaN NaT NaT
3 Starfish 4 NaT NaT NaT
The fifth row was dropped.
Next, use how
and specify the axis
:
dfresult = df2.dropna(how='all', axis=1)
print(dfresult)
The columns with all
values equal to NA
will be dropped:
OutputName ID Population Regions
0 Shark 1 100 1
1 Whale 2 200 None
2 Jellyfish 3 NaN NaT
3 Starfish 4 NaT NaT
4 NaT NaT NaT NaT
The fifth column was dropped.
thresh
Use the second DataFrame with thresh
to drop rows that do not meet the threshold of at least 3
non-NA
values:
dfresult = df2.dropna(thresh=3)
print(dfresult)
The rows do not have at least 3
non-NA
will be dropped:
OutputName ID Population Regions Endangered
0 Shark 1 100 1 NaT
1 Whale 2 200 None NaT
The third, fourth, and fifth rows were dropped.
subsets
Use the second DataFrame with subset
to drop rows with NA
values in the Population
column:
dfresult = df2.dropna(subset=['Population'])
print(dfresult)
The rows that have Population
with NA
values will be dropped:
OutputName ID Population Regions Endangered
0 Shark 1 100 1 NaT
1 Whale 2 200 None NaT
The third, fourth, and fifth rows were dropped.
You can also specify the index
values in the subset
when dropping columns from the DataFrame:
dfresult = df2.dropna(subset=[1, 2], axis=1)
print(dfresult)
The columns that contain NA
values in subset of rows 1
and 2
:
OutputName ID
0 Shark 1
1 Whale 2
2 Jellyfish 3
3 Starfish 4
4 NaT NaT
The third, fourth, and fifth columns were dropped.
inplace
By default, dropna()
does not modify the source DataFrame. However, in some cases, you may wish to save memory when working with a large source DataFrame by using inplace
.
df1.dropna(inplace=True)
print(df1)
This code does not use a dfresult
variable.
This will output:
OutputName ID Population Regions
0 Shark 1 100 1
The original DataFrame has been modified.
In this article, you used the dropna()
function to remove rows and columns with NA
values.
Continue your learning with more Python and pandas tutorials - Python pandas Module Tutorial, pandas Drop Duplicate Rows.
References
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Thank u bro, well explained in very simple way
- KHAJA MOINUDDIN KHAN
thats very comprehensive. out of all drop explanation … this is the best thank you
- johny