Data Cleaning in Pandas

Data cleaning is an essential step in the data preprocessing stage of any data analysis project. In pandas, there are several methods available to clean and preprocess data.

Here's an example of data cleaning in pandas:

Suppose we have a dataset containing information about customers of a retail store. The dataset contains the following columns: customer_id, name, age, gender, email, address, and purchase_amount. Let's assume the dataset has missing values, duplicates, and inconsistent data.

Loading the data:

First, we need to load the data into a pandas dataframe:

import pandas as pd
df = pd.read_csv('customer_data.csv')

Handling inconsistent formatting:

Inconsistent formatting of data can make it difficult to analyze. We can use string methods to clean and standardize text data.
For example, to standardize the 'email' column to lowercase:

df['email'] = df['email'].str.lower()

Handling incorrect data types:

Sometimes, data may be in the wrong data type, such as a string instead of a numeric value. We can use the astype() method to convert the data type.
For example, to convert the 'age' column from object to integer data type:

df['age'] = df['age'].astype(int)

Handling missing values:

We can check for missing values in the dataframe using the isnull() method. If there are missing values, we can handle them in several ways, such as filling them with the mean or median value of the column, dropping the rows or columns with missing values, or filling them with a default value.

For example, to fill missing values with the mean value of the column 'age':

mean_age = df['age'].mean()
df['age'].fillna(mean_age, inplace=True)

To drop rows with missing values:

df.dropna(inplace=True)

Handling duplicates:

We can check for duplicates in the dataframe using the duplicated() method. If there are duplicates, we can handle them by dropping them or keeping only the first occurrence.

For example, to drop duplicates:

df.drop_duplicates(inplace=True)

Handling inconsistent data:

We can check for inconsistent data in the dataframe using various methods, such as checking for outliers, incorrect data types, or inconsistent formatting.

For example, to check if the 'gender' column contains inconsistent data:

unique_genders = df['gender'].unique()
print(unique_genders)

If the output shows inconsistent gender values, we can handle them by mapping them to a standard format:

df['gender'] = df['gender'].map({'M': 'Male', 'F': 'Female'})

Removing outliers:

Outliers are extreme values that can skew the analysis. We can identify and remove outliers using various methods such as box plots, scatter plots, or z-score.
For example, to remove outliers from the 'purchase_amount' column using the interquartile range (IQR) method:

Q1 = df['purchase_amount'].quantile(0.25)
Q3 = df['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['purchase_amount'] > (Q1 - 1.5 * IQR)) & (df['purchase_amount'] < (Q3 + 1.5 * IQR))]

Data cleaning is an iterative process, and we may need to perform multiple rounds of cleaning to ensure the data is ready for analysis.

Thank You So Much for Reading Commom Data Cleaning Steps in Pandas Article.

Thabresh Syed - Data Science Daily

Commom Data Cleaning Steps in Pandas

Data Cleaning in Pandas

Loading the data:

Handling inconsistent formatting:

Handling incorrect data types:

Handling missing values:

Handling duplicates:

Handling inconsistent data:

Removing outliers:

Post a Comment

How to split a dataset into training and testing data sets for Machine Learning

Essential Excel Formulas for Data Analysts - Basics

All about chatGPT | How to Use | Features | Limitations

Data Analyst Learning Path 📌 - Roles, Best Courses

Boost Your Business Efficiency with These 15 Fantastic AI Tools for Entrepreneurs

Thabresh Syed