Pandas Data Cleaning
Data cleaning is the process of handling useless data.
Many datasets contain missing data, incorrect data formats, erroneous data, or duplicate data. To make data analysis more accurate, it is necessary to process these useless data.
In this tutorial, we will use the Pandas package for data cleaning.
The test data used in this article is property-data.csv as follows:
The table above includes four types of empty data:
- n/a
- NA
- —
- na
Pandas Cleaning Empty Values
If we want to remove rows that contain empty fields, we can use the dropna() method with the following syntax:
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Parameter Description:
- axis: Defaults to 0, which means removing the entire row if any empty value is found. If set to axis=1, it means removing the entire column.
- how: Defaults to 'any'. If any data in a row (or column) is NA, the entire row is removed. If set to how='all', the entire row is removed only if all data in the row (or column) are NA.
- thresh: Sets the minimum number of non-empty values required to keep the row.
- subset: Specifies the columns to check. For multiple columns, a list of column names can be used as the parameter.
- inplace: If set to True, the calculated values will directly overwrite the previous values and return None, modifying the source data.
We can determine if each cell is empty using the isnull()
method.
Example
import pandas as pd
df = pd.read_csv('property-data.csv')
print(df['NUM_BEDROOMS'])
print(df['NUM_BEDROOMS'].isnull())
The output of the above example is as follows:
In this example, we see that Pandas treats n/a and NA as empty data, but not na, which does not meet our requirements. We can specify the types of empty data:
Example
import pandas as pd
missing_values = ["n/a", "na", "--"]
df = pd.read_csv('property-data.csv', na_values=missing_values)
print(df['NUM_BEDROOMS'])
print(df['NUM_BEDROOMS'].isnull())
The output of the above example is as follows:
The following example demonstrates removing rows that contain empty data.
Example
import pandas as pd
df = pd.read_csv('property-data.csv')
new_df = df.dropna()
print(new_df.to_string())
The output of the above example is as follows:
Note: By default, the dropna() method returns a new DataFrame and does not modify the source data.
If you want to modify the source data DataFrame, you can use the inplace=True parameter:
Example
import pandas as pd
df = pd.read_csv('property-data.csv')
df.dropna(inplace=True)
print(df.to_string())
The output of the above example is as follows:
We can also remove rows with empty values in specified columns:
Example
Remove rows where the ST_NUM column has empty values:
import pandas as pd
df = pd.read_csv('property-data.csv')
df.dropna(subset=['ST_NUM'], inplace=True)
print(df.to_string())
The output of the above example is as follows:
We can also use the fillna() method to replace empty fields:
Example
Replace empty fields with 12345:
import pandas as pd
df = pd.read_csv('property-data.csv')
df.fillna(12345, inplace=True)
print(df.to_string())
The output of the above example is as follows:
We can also specify a particular column to replace data:
Example
Replace empty data in the PID column with 12345:
import pandas as pd
df = pd.read_csv('property-data.csv')
df['PID'].fillna(12345, inplace=True)
print(df.to_string())
The output of the above example is as follows: Common methods to replace empty cells are to calculate the mean, median, or mode of the column.
Pandas uses the mean()
, median()
, and mode()
methods to calculate the mean (average of all values), median (middle value after sorting), and mode (most frequent value) of the column.
Example
Using the mean()
method to calculate the mean of the column and replace empty cells:
import pandas as pd
df = pd.read_csv('property-data.csv')
x = df["ST_NUM"].mean()
df["ST_NUM"].fillna(x, inplace=True)
print(df.to_string())
The above example outputs the following result, with the red box indicating the mean calculated to replace the empty cell:
Example
Using the median()
method to calculate the median of the column and replace empty cells:
import pandas as pd
df = pd.read_csv('property-data.csv')
x = df["ST_NUM"].median()
df["ST_NUM"].fillna(x, inplace=True)
print(df.to_string())
The above example outputs the following result, with the red box indicating the median calculated to replace the empty cell:
Example
Using the mode()
method to calculate the mode of the column and replace empty cells:
import pandas as pd
df = pd.read_csv('property-data.csv')
x = df["ST_NUM"].mode()
df["ST_NUM"].fillna(x, inplace=True)
print(df.to_string())
The above example outputs the following result, with the red box indicating the mode calculated to replace the empty cell:
Pandas Cleaning Incorrect Format Data
Cells with incorrect data formats can make data analysis difficult or even impossible.
We can format the data by including rows with empty cells or converting all cells in the column to the same format.
The following example formats the date:
Example
import pandas as pd
# The third date is in the wrong format
data = {
"Date": ['2020/12/01', '2020/12/02', '20201226'],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index=["day1", "day2", "day3"])
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
The above example outputs the following result:
Date duration
day1 2020-12-01 50
day2 2020-12-02 40
day3 2020-12-26 45
Pandas Cleaning Incorrect Data
Incorrect data is also common, and we can replace or remove such data.
The following example replaces incorrect age data:
Example
import pandas as pd
person = {
"name": ['Google', 'tutorialpro', 'Taobao'],
"age": [50, 40, 12345] # 12345 is incorrect age data
}
df = pd.DataFrame(person)
df.loc[2, 'age'] = 30 # Modify the data
print(df.to_string())
The above example outputs the following result:
name age
0 Google 50
1 tutorialpro 40
2 Taobao 30
You can also set conditional statements:
Example
Set age greater than 120 to 120:
import pandas as pd
person = {
"name": ['Google', 'tutorialpro', 'Taobao'],
"age": [50, 200, 12345]
}
df = pd.DataFrame(person)
for x in df.index:
if df.loc[x, "age"] > 120:
df.loc[x, "age"] = 120
print(df.to_string())
The above example outputs the following result:
name age
0 Google 50
1 tutorialpro 120
2 Taobao 120
You can also delete rows with incorrect data:
Example
Delete rows with age greater than 120:
import pandas as pd
person = {
"name": ['Google', 'tutorialpro', 'Taobao'],
"age": [50, 200, 12345]
}
df = pd.DataFrame(person)
df = df[df['age'] <= 120]
print(df.to_string())
The above example outputs the following result:
name age
0 Google 50
{
"age": [50, 40, 12345] # 12345 age data is incorrect
}
df = pd.DataFrame(person)
for x in df.index:
if df.loc[x, "age"] > 120:
df.drop(x, inplace = True)
print(df.to_string())
The above example outputs the following result:
name age
0 Google 50
1 tutorialpro 40
Cleaning Duplicate Data in Pandas
To clean duplicate data, you can use the duplicated() and drop_duplicates() methods.
The duplicated() method returns True if the data is duplicated, otherwise it returns False.
Example
import pandas as pd
person = {
"name": ['Google', 'tutorialpro', 'tutorialpro', 'Taobao'],
"age": [50, 40, 40, 23]
}
df = pd.DataFrame(person)
print(df.duplicated())
The above example outputs the following result:
0 False
1 False
2 True
3 False
dtype: bool
To remove duplicate data, you can directly use the drop_duplicates() method.
Example
import pandas as pd
persons = {
"name": ['Google', 'tutorialpro', 'tutorialpro', 'Taobao'],
"age": [50, 40, 40, 23]
}
df = pd.DataFrame(persons)
df.drop_duplicates(inplace = True)
print(df)
The above example outputs the following result:
name age
0 Google 50
1 tutorialpro 40
3 Taobao 23