Learn Data Science with Thabresh Syed Click Here !

Commom Data Cleaning Steps in Pandas

Learn about data cleaning and preprocessing in pandas, including common steps like handling missing values and converting data types.

Data Cleaning in Pandas


Data cleaning is an essential step in the data preprocessing stage of any data analysis project. In pandas, there are several methods available to clean and preprocess data.

Here's an example of data cleaning in pandas:

Suppose we have a dataset containing information about customers of a retail store. The dataset contains the following columns: customer_id, name, age, gender, email, address, and purchase_amount. Let's assume the dataset has missing values, duplicates, and inconsistent data.

Loading the data:

First, we need to load the data into a pandas dataframe:

import pandas as pd
df = pd.read_csv('customer_data.csv')

Handling inconsistent formatting:

Inconsistent formatting of data can make it difficult to analyze. We can use string methods to clean and standardize text data.
For example, to standardize the 'email' column to lowercase:

df['email'] = df['email'].str.lower()

Handling incorrect data types:

Sometimes, data may be in the wrong data type, such as a string instead of a numeric value. We can use the astype() method to convert the data type.
For example, to convert the 'age' column from object to integer data type:

df['age'] = df['age'].astype(int)

Handling missing values:

We can check for missing values in the dataframe using the isnull() method. If there are missing values, we can handle them in several ways, such as filling them with the mean or median value of the column, dropping the rows or columns with missing values, or filling them with a default value.

For example, to fill missing values with the mean value of the column 'age':

mean_age = df['age'].mean()
df['age'].fillna(mean_age, inplace=True)

To drop rows with missing values:

df.dropna(inplace=True)

Handling duplicates:

We can check for duplicates in the dataframe using the duplicated() method. If there are duplicates, we can handle them by dropping them or keeping only the first occurrence.

For example, to drop duplicates:

df.drop_duplicates(inplace=True)

Handling inconsistent data:

We can check for inconsistent data in the dataframe using various methods, such as checking for outliers, incorrect data types, or inconsistent formatting.

For example, to check if the 'gender' column contains inconsistent data:

unique_genders = df['gender'].unique()
print(unique_genders)

If the output shows inconsistent gender values, we can handle them by mapping them to a standard format:

df['gender'] = df['gender'].map({'M': 'Male', 'F': 'Female'})

Removing outliers:

Outliers are extreme values that can skew the analysis. We can identify and remove outliers using various methods such as box plots, scatter plots, or z-score.
For example, to remove outliers from the 'purchase_amount' column using the interquartile range (IQR) method:

Q1 = df['purchase_amount'].quantile(0.25)
Q3 = df['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['purchase_amount'] > (Q1 - 1.5 * IQR)) & (df['purchase_amount'] < (Q3 + 1.5 * IQR))]

Data cleaning is an iterative process, and we may need to perform multiple rounds of cleaning to ensure the data is ready for analysis.

Related Posts

Thank You So Much for Reading Commom Data Cleaning Steps in Pandas Article.

Post a Comment

Cookie Consent
We serve cookies on this site to analyze traffic, remember your preferences, and optimize your experience.
Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.
AdBlock Detected!
We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.
Site is Blocked
Sorry! This site is not available in your country.