Pandas Data Cleaning

Data cleaning is the process of handling useless data.

Many datasets contain missing data, incorrect data formats, erroneous data, or duplicate data. To make data analysis more accurate, it is necessary to process these useless data.

In this tutorial, we will use the Pandas package for data cleaning.

The test data used in this article is property-data.csv as follows:

The table above includes four types of empty data:

Pandas Cleaning Empty Values

If we want to remove rows that contain empty fields, we can use the dropna() method with the following syntax:

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Parameter Description:

axis: Defaults to 0, which means removing the entire row if any empty value is found. If set to axis=1, it means removing the entire column.
how: Defaults to 'any'. If any data in a row (or column) is NA, the entire row is removed. If set to how='all', the entire row is removed only if all data in the row (or column) are NA.
thresh: Sets the minimum number of non-empty values required to keep the row.
subset: Specifies the columns to check. For multiple columns, a list of column names can be used as the parameter.
inplace: If set to True, the calculated values will directly overwrite the previous values and return None, modifying the source data.

We can determine if each cell is empty using the isnull() method.

Example

import pandas as pd

df = pd.read_csv('property-data.csv')

print(df['NUM_BEDROOMS'])
print(df['NUM_BEDROOMS'].isnull())

The output of the above example is as follows:

In this example, we see that Pandas treats n/a and NA as empty data, but not na, which does not meet our requirements. We can specify the types of empty data:

Example

import pandas as pd

missing_values = ["n/a", "na", "--"]
df = pd.read_csv('property-data.csv', na_values=missing_values)

print(df['NUM_BEDROOMS'])
print(df['NUM_BEDROOMS'].isnull())

The output of the above example is as follows:

The following example demonstrates removing rows that contain empty data.

Example

import pandas as pd

df = pd.read_csv('property-data.csv')

new_df = df.dropna()

print(new_df.to_string())

The output of the above example is as follows:

Note: By default, the dropna() method returns a new DataFrame and does not modify the source data.

If you want to modify the source data DataFrame, you can use the inplace=True parameter:

Example

import pandas as pd

df = pd.read_csv('property-data.csv')

df.dropna(inplace=True)

print(df.to_string())

The output of the above example is as follows:

We can also remove rows with empty values in specified columns:

Example

Remove rows where the ST_NUM column has empty values:

import pandas as pd

df = pd.read_csv('property-data.csv')

df.dropna(subset=['ST_NUM'], inplace=True)

print(df.to_string())

The output of the above example is as follows:

We can also use the fillna() method to replace empty fields:

Example

Replace empty fields with 12345:

import pandas as pd

df = pd.read_csv('property-data.csv')

df.fillna(12345, inplace=True)

print(df.to_string())

The output of the above example is as follows:

We can also specify a particular column to replace data:

Example

Replace empty data in the PID column with 12345:

import pandas as pd

df = pd.read_csv('property-data.csv')

df['PID'].fillna(12345, inplace=True)

print(df.to_string())

The output of the above example is as follows: Common methods to replace empty cells are to calculate the mean, median, or mode of the column.

Pandas uses the mean(), median(), and mode() methods to calculate the mean (average of all values), median (middle value after sorting), and mode (most frequent value) of the column.

Example

Using the mean() method to calculate the mean of the column and replace empty cells:

import pandas as pd

df = pd.read_csv('property-data.csv')

x = df["ST_NUM"].mean()

df["ST_NUM"].fillna(x, inplace=True)

print(df.to_string())

The above example outputs the following result, with the red box indicating the mean calculated to replace the empty cell:

Example

Using the median() method to calculate the median of the column and replace empty cells:

import pandas as pd

df = pd.read_csv('property-data.csv')

x = df["ST_NUM"].median()

df["ST_NUM"].fillna(x, inplace=True)

print(df.to_string())

The above example outputs the following result, with the red box indicating the median calculated to replace the empty cell:

Example

Using the mode() method to calculate the mode of the column and replace empty cells:

import pandas as pd

df = pd.read_csv('property-data.csv')

x = df["ST_NUM"].mode()

df["ST_NUM"].fillna(x, inplace=True)

print(df.to_string())

The above example outputs the following result, with the red box indicating the mode calculated to replace the empty cell:

Pandas Cleaning Incorrect Format Data

Cells with incorrect data formats can make data analysis difficult or even impossible.

We can format the data by including rows with empty cells or converting all cells in the column to the same format.

The following example formats the date:

Example

import pandas as pd

# The third date is in the wrong format
data = {
  "Date": ['2020/12/01', '2020/12/02', '20201226'],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index=["day1", "day2", "day3"])

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())

The above example outputs the following result:

        Date  duration
day1 2020-12-01        50
day2 2020-12-02        40
day3 2020-12-26        45

Pandas Cleaning Incorrect Data

Incorrect data is also common, and we can replace or remove such data.

The following example replaces incorrect age data:

Example

import pandas as pd

person = {
  "name": ['Google', 'tutorialpro', 'Taobao'],
  "age": [50, 40, 12345]  # 12345 is incorrect age data
}

df = pd.DataFrame(person)

df.loc[2, 'age'] = 30  # Modify the data

print(df.to_string())

The above example outputs the following result:

        name  age
0      Google   50
1  tutorialpro   40
2      Taobao   30

You can also set conditional statements:

Example

Set age greater than 120 to 120:

import pandas as pd

person = {
  "name": ['Google', 'tutorialpro', 'Taobao'],
  "age": [50, 200, 12345]
}

df = pd.DataFrame(person)

for x in df.index:
  if df.loc[x, "age"] > 120:
    df.loc[x, "age"] = 120

print(df.to_string())

The above example outputs the following result:

        name  age
0      Google   50
1  tutorialpro  120
2      Taobao  120

You can also delete rows with incorrect data:

Example

Delete rows with age greater than 120:

import pandas as pd

person = {
  "name": ['Google', 'tutorialpro', 'Taobao'],
  "age": [50, 200, 12345]
}

df = pd.DataFrame(person)

df = df[df['age'] <= 120]

print(df.to_string())

The above example outputs the following result:

        name  age
0      Google   50

{
  "age": [50, 40, 12345]   # 12345 age data is incorrect
}

df = pd.DataFrame(person)

for x in df.index:
  if df.loc[x, "age"] > 120:
    df.drop(x, inplace = True)

print(df.to_string())

The above example outputs the following result:

name  age
0  Google   50
1  tutorialpro   40

Cleaning Duplicate Data in Pandas

To clean duplicate data, you can use the duplicated() and drop_duplicates() methods.

The duplicated() method returns True if the data is duplicated, otherwise it returns False.

Example

import pandas as pd

person = {
  "name": ['Google', 'tutorialpro', 'tutorialpro', 'Taobao'],
  "age": [50, 40, 40, 23]  
}
df = pd.DataFrame(person)

print(df.duplicated())

The above example outputs the following result:

0    False
1    False
2     True
3    False
dtype: bool

To remove duplicate data, you can directly use the drop_duplicates() method.

Example

import pandas as pd

persons = {
  "name": ['Google', 'tutorialpro', 'tutorialpro', 'Taobao'],
  "age": [50, 40, 40, 23]  
}

df = pd.DataFrame(persons)

df.drop_duplicates(inplace = True)
print(df)

The above example outputs the following result:

name  age
0  Google   50
1  tutorialpro   40
3  Taobao   23

❮ Pandas Json Pandas Dataframe ❯