Data scientists spend 60-80% of their time cleaning data. Messy, incomplete, inconsistent data is the norm in the real world. This guide covers every common scenario using Pandas — the most essential data cleaning tool in Python.

The Data Cleaning Checklist

Inspect shape, dtypes, and missing values
Handle missing values
Remove duplicates
Fix data types
Handle outliers
Standardise text and categories
Validate business rules

Step 1: Initial Inspection

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')
print(df.shape, df.dtypes)
print(df.info())
print(df.describe())

missing = pd.DataFrame({
    'count': df.isnull().sum(),
    'pct': (df.isnull().mean() * 100).round(2)
}).sort_values('pct', ascending=False)
print(missing[missing['count'] > 0])

Step 2: Handle Missing Values

df = df.loc[:, df.isnull().mean() < 0.5]  # drop cols > 50% missing

for col in df.select_dtypes(include=np.number).columns:
    df[col] = df[col].fillna(df[col].median())

for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else 'Unknown')

print('Nulls remaining:', df.isnull().sum().sum())

Step 3: Duplicates

print('Duplicates:', df.duplicated().sum())
df = df.drop_duplicates()
df = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='last')

Step 4: Fix Data Types

df['revenue'] = pd.to_numeric(df['revenue'].str.replace('[$,]', '', regex=True), errors='coerce')
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['is_active'] = df['is_active'].astype(bool)
df['category'] = df['category'].astype('category')

Step 5: Outlier Removal

def remove_iqr_outliers(df, col):
    Q1, Q3 = df[col].quantile([0.25, 0.75])
    IQR = Q3 - Q1
    mask = df[col].between(Q1 - 1.5*IQR, Q3 + 1.5*IQR)
    print(f'{col}: removed {(~mask).sum()} outliers')
    return df[mask]

df = remove_iqr_outliers(df, 'salary')

Step 6: Standardise Text

df['city'] = df['city'].str.strip().str.title()
city_aliases = {'Bombay': 'Mumbai', 'Bangalore': 'Bengaluru', 'BLR': 'Bengaluru'}
df['city'] = df['city'].replace(city_aliases)
df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True).str[-10:]

FAQ

Drop rows or impute missing values?

Drop when under 5% missing and rows are independent. Impute with median or mode for moderate missingness. Use KNN imputation for important features with high missingness.

How to handle outliers that are real data?

Cap them (winsorising) instead of removing — set values above the 99th percentile to the 99th percentile. This preserves the data point while reducing its disproportionate influence on models.

Data Cleaning in Python: Complete Pandas Guide (2026)

Table of Content

The Data Cleaning Checklist

Step 1: Initial Inspection

Step 2: Handle Missing Values

Step 3: Duplicates

Step 4: Fix Data Types

Step 5: Outlier Removal

Step 6: Standardise Text

FAQ

Drop rows or impute missing values?

How to handle outliers that are real data?

Leave feedback about this Cancel Reply

Latest Posts

SQL Window Functions: Complete Guide with Real Examples (2026)

Time Series Analysis in Python: Complete Guide with Examples (2026)

Natural Language Processing (NLP) with Python: Beginner’s Guide 2026

List of Categories

About us

Categories

The latest

SQL Window Functions: Complete Guide with Real Examples (2026)

Time Series Analysis in Python: Complete Guide with Examples (2026)

Natural Language Processing (NLP) with Python: Beginner’s Guide 2026

Subscribe

Visionaries’ Insights into the Future of Data and Technology

UrbanObserver

Subscribe to newsletter

Data Cleaning in Python: Complete Pandas Guide (2026)

Table of Content

The Data Cleaning Checklist

Step 1: Initial Inspection

Step 2: Handle Missing Values

Step 3: Duplicates

Step 4: Fix Data Types

Step 5: Outlier Removal

Step 6: Standardise Text

FAQ

Drop rows or impute missing values?

How to handle outliers that are real data?

Leave feedback about this Cancel Reply

Latest Posts

SQL Window Functions: Complete Guide with Real Examples (2026)

Time Series Analysis in Python: Complete Guide with Examples (2026)

Natural Language Processing (NLP) with Python: Beginner’s Guide 2026

List of Categories

About us

Categories

The latest

SQL Window Functions: Complete Guide with Real Examples (2026)

Time Series Analysis in Python: Complete Guide with Examples (2026)

Natural Language Processing (NLP) with Python: Beginner’s Guide 2026

Subscribe

Visionaries’ Insights into the Future of Data and Technology