Data Cleaning in Pandas
Data cleaning is an essential step in the data preprocessing stage of any data analysis project. In pandas, there are several methods available to clean and preprocess data.
Here's an example of data cleaning in pandas:
Suppose we have a dataset containing information about customers of a retail store. The dataset contains the following columns: customer_id, name, age, gender, email, address, and purchase_amount. Let's assume the dataset has missing values, duplicates, and inconsistent data.
Loading the data:
First, we need to load the data into a pandas dataframe:
import pandas as pd
df = pd.read_csv('customer_data.csv')
Handling inconsistent formatting:
Inconsistent formatting of data can make it difficult to analyze. We can use
string methods to clean and standardize text data.
For example, to standardize the 'email' column to lowercase:
df['email'] = df['email'].str.lower()
Handling incorrect data types:
Sometimes, data may be in the wrong data type, such as a string instead of a
numeric value. We can use the astype() method to convert the data type.
For example, to convert the 'age' column from object to integer data type:
df['age'] = df['age'].astype(int)
Handling missing values:
We can check for missing values in the dataframe using the isnull() method. If there are missing values, we can handle them in several ways, such as filling them with the mean or median value of the column, dropping the rows or columns with missing values, or filling them with a default value.
For example, to fill missing values with the mean value of the column 'age':
mean_age = df['age'].mean()
df['age'].fillna(mean_age, inplace=True)
To drop rows with missing values:
df.dropna(inplace=True)
Handling duplicates:
We can check for duplicates in the dataframe using the duplicated() method. If there are duplicates, we can handle them by dropping them or keeping only the first occurrence.
For example, to drop duplicates:
df.drop_duplicates(inplace=True)
Handling inconsistent data:
We can check for inconsistent data in the dataframe using various methods, such as checking for outliers, incorrect data types, or inconsistent formatting.
For example, to check if the 'gender' column contains inconsistent data:
unique_genders = df['gender'].unique()
print(unique_genders)
If the output shows inconsistent gender values, we can handle them by mapping them to a standard format:
df['gender'] = df['gender'].map({'M': 'Male', 'F': 'Female'})
Removing outliers:
Outliers are extreme values that can skew the analysis. We can identify and
remove outliers using various methods such as box plots, scatter plots, or z-score.
For example, to
remove outliers from the 'purchase_amount' column using the interquartile range (IQR) method:
Q1 = df['purchase_amount'].quantile(0.25)
Q3 = df['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['purchase_amount'] > (Q1 - 1.5 * IQR)) & (df['purchase_amount'] < (Q3 + 1.5 * IQR))]
Data cleaning is an iterative process, and we may need to perform multiple rounds of cleaning to ensure the data is ready for analysis.
Related Posts
Thank You So Much for Reading Commom Data Cleaning Steps in Pandas Article.